Winning both speed and quality: How Yetter deals with diffusion models

Explore how the Yetter Inference Engine overcomes the limitations of step caching and model distillation for diffusion models. We analyze latency, diversity, quality, and negative-prompt handling to reveal what truly matters for scalable, real-time image generation.
Yeonjoon Jung's avatar
Oct 31, 2025
Winning both speed and quality: How Yetter deals with diffusion models

Introduction

Diffusion models are now a cornerstone of visual synthesis. Architectural innovations from early U-Net designs to contemporary Diffusion Transformers (DiT) have driven major gains on visual tasks. Scaling these DiT architectures to tens of billions of parameters, as in FLUX.1 and Qwen-Image, has pushed generation quality to an unprecedented levels.
While scaling the models has yielded remarkable performance gains, computation and memory demand has become severe. Unlike autoregressive language models that exploit KV caching for efficiency, diffusion models require full computation at every denoising step. This step-by-step sampling is a fundamental bottleneck for real-time or high-throughput use, making acceleration essential. Among various strategies, the two most commonly used methods are step caching and model distillation. However, step caching yields only modest speedups, while distillation trades away creativity, diversity, and guidance control for speed.
In this article, we analyze the tradeoffs of previous approaches and introduce Yetter Inference Engine, which overcomes their core limitations. Through our experiments with Qwen-Image, we demonstrate how our system achieves scalable, high-quality diffusion inference without compromising generative quality.

Break Points of Existing Approaches

Caching: Quality Preserved, Limited Speedup

Figure 1. Illustration of the first-block caching method. MM-DiT implies multi-modal diffusion transformers, a variant of DiT that processes text and image tokens together in a single transformer.
Figure 1. Illustration of the first-block caching method. MM-DiT implies multi-modal diffusion transformers, a variant of DiT that processes text and image tokens together in a single transformer.
A typical diffusion model involves 10~100 sequential denoising steps to coherently update a noisy latent into a refined image. Caching method, such as First-Block Cache, TaylorSeer, or TeaCache, is an intuitive optimization strategy that predicts the update for current step based on the previous step’s cached information. Since caching methods predict the update instead of fully computing it, the method significantly decreases the computational workload.
First-block caching is a representative caching method implemented in huggingface diffusers library to increase efficiency (see Figure 1). Unlike the baseline, which unconditionally executes all of the NN Multi-Modal DiT (MM-DiT) blocks for each denoising step, first-block caching inserts a checkpoint after the initial block to compare if its residual value (output-input) closely matches that from the previous step. If the difference is smaller than the threshold, system assumes minimal change in the direction of update. Thus, the system skips the remaining N1N–1 blocks and directly reuses the cached latent for the current step’s output. Since, weight of the model remains the same, caching based methods show advantage in preserving the capacity of the original model. On the other hand, expected speedup of caching methods is limited to 2~3 times to maintain the quality in practice, which is relatively slower than other acceleration methods

Speedup-Quality Tradeoff Analysis

Figure 2. The speedup vs. quality tradeoff when using first-block caching method on Qwen-Image.
Figure 2. The speedup vs. quality tradeoff when using first-block caching method on Qwen-Image.
We have evaluated the tradeoff between speedup and quality of the first-block caching method on Qwen-Image model. As show in Figure 2, modifying the threshold up to 2x speedup preserves quality of the original model. However, image starts to blur when pushing speedup beyond 4x and by 6x the image becomes unusable. The reason for quality degradation is because aggressive caching causes the model to skip necessary refinement steps. As a result, caching methods be safe at modest levels, but cannot be pushed to deliver groundbreaking speed practitioners expect.

Distillation: High-Speed, Low-Quality

While caching methods focused on optimizing inference of the original model, distillation methods focus on compressing the denoising process through retraining. Distillation methods, such as DMD2, directly reduce the number of denoising steps by training a new student model. The new model is supervised to mimic the outputs of a full-step teacher model just in a single or few steps. For instance, a popular distilled model, Qwen-Image-Lightning, reduces 50 step inference of the base model into 4~8 steps. To further leverage the speedup, most distilled models remove classifier-free guidance (CFG), doubling the throughput by eliminating the negative prompt pass. All things considered, retrained distilled model can deliver tenfold or even greater speedups for practitioners seeking real-time generation. However, fewer denoising steps inevitably degrade quality. In the following section, we examine three practical limitations of distilled models, explained with Qwen-Image (50 step inference) and Qwen-Image-Lightning (8 step inference).

Diversity Reduction

Figure 3. Diversity analysis of the distilled model, showing its failure to generate diverse images from different initial noise settings.
Figure 3. Diversity analysis of the distilled model, showing its failure to generate diverse images from different initial noise settings.
First, diversity is reduced. Distillation method trains the student to match the teacher’s output distribution. During this process, the distilled student is encouraged to learn high-probability regions rather than preserving the teacher distribution’s full variance. As a result, the original model’s ability to generate different outcomes from different initial noise diminishes in the distilled model. As shown in Figure 3, distilled model produces highly similar images regardless of the input noise or random seed. This loss of stochasticity constrains diversity and limits imaginative outputs.

Degraded Quality

Figure 4. Quality comparison between the Base Model and Distilled Model. The distilled model shows degraded performance, particularly in natural color saturation and fine texture rendering.
Figure 4. Quality comparison between the Base Model and Distilled Model. The distilled model shows degraded performance, particularly in natural color saturation and fine texture rendering.
Second, overall image quality decreases. A distilled model collapses the behavior of multiple denoising steps into few step to reach the final state quickly. This removes the intermediate refinement stages where fine details are rendered. Consequently, even when metric scores appear similar, human evaluations show a clear gap between the base and the distilled model. As shown in Figures 4(a) and 4(c), the distilled model produces excessively intense colors and saturation, resulting an unnatural, oversaturated image. In Figure 4(b), fine, fur-like textures appear less natural and are rendered as clumps instead of being delicately expressed.

Inability to Handle Negative Prompts

Figure 5. The impact of negative prompts on the distilled model. The output quality drops significantly, and the model shows low fidelity to the negative prompt's guidance.
Figure 5. The impact of negative prompts on the distilled model. The output quality drops significantly, and the model shows low fidelity to the negative prompt's guidance.
The most significant functional limitation is the inability to handle negative prompts. This is critical because most text encoders struggle with interpreting negation. For example, a prompt like "Don’t think about elephants" is likely to create an elephant, as the encoder focuses on the token for elephant. CFG with explicit negative prompts counteracts this by enforcing constraints, making it essential for fine-grained control. However, popular distilled models omit CFG during training to gain an additional 2x speedup and therefore do not accept negative prompts. When a negative prompt is forced at inference, performance degrades sharply. Figure 5 provides a clear illustration, where overall quality drops significantly under negative prompts. In the second example, distilled model fails to follow the instruction, highlighting a fundamental limitation of distilled models for prompt-level guidance.

Solution from the Yetter Inference Engine

Novel Inference Pipeline for Speed and Quality

The two prior approaches have critical limitations. The caching based method offers only modest speedup, while distillation method constrains the original model’s capabilities. To overcome these limitations, we implemented a novel inference pipeline in the Yetter Inference Engine that delivers breakthrough speedup while preserving quality.
The core idea behind the engine stems from a simple experimental observation. The model's output diversity and the effect of CFG are highly dependent on the early, high-noise steps of the diffusion process. By contrast, distilled models are efficient in the final, low-noise steps, when they operate on an already well-formed latent.
Figure 6. Illustration of the inference pipeline architecture for diffusion models in Yetter Inference Engine. It is composed of a Base Model, an optional Bridging Model, and a Distilled Model.
Figure 6. Illustration of the inference pipeline architecture for diffusion models in Yetter Inference Engine. It is composed of a Base Model, an optional Bridging Model, and a Distilled Model.
Based on this insight, we developed a novel inference engine that systematically overcomes prior limitations by utilizing the best suited model for each stage. As shown in Figure 6, Yetter Inference Engine’s multi-stage pipeline begins with the base model, which seeds generation with high diversity and correctly reflects negative prompts in the latent space. The latent representation is then passed to a bridging model that connects the base model’s intermediate output to the distilled model, aligning the update flow. Finally, the distilled model completes the process, generating the final image from the well-formed latent in just a few steps. Throughout the pipeline, the engine applies stage-specific caching to further accelerate execution. This pipeline in the Yetter Inference Engine enables remarkable gains in both speed and quality.

System-Level Optimizations

Despite the novel pipeline design at the algorithmic level, Yetter Inference Engine provides three key capabilities for efficiently scaling AI systems.
  • Model Optimization
  • Pipeline Optimization
  • Graph Compilation
Specifically, the engine reduces model size and latency via quantization and pruning to accelerate diffusion workloads. At the pipeline level, we efficiently schedule models and share weights between the base and distilled variants. Here, lightweight LoRA captures the residual differences and are dynamically injected at inference. We also run inference with compiled computation graphs and make hardware-aware architectural adjustments. These refinements resolve compilation issues and enable the selection of optimal kernels for additional speedups.
All optimizations in Yetter Inference Engine have been validated across diverse hardware, including NVIDIA GPUs, AMD GPUs, Intel Gaudi, Google TPUs, and more. This integrated, hardware-aware approach preserves performance while accelerating end-to-end inference across heterogeneous devices.

Results: Compelling Evidence of a Broken Tradeoff

Yetter Inference Engine is designed to avoid performance compromises, achieving the speed of distillation while preserving the base model’s full features. Our experiments confirm its success on both perspective.

Diversity Comparison

Figure 7. Diversity analysis of the Yetter Inference Engine. The engine successfully generates diverse images from different initial noise settings
Figure 7. Diversity analysis of the Yetter Inference Engine. The engine successfully generates diverse images from different initial noise settings
First, the engine can preserve randomness of the original model. As shown in Figure 7, sample diversity from the Yetter Inference Engine is significantly higher than that of the standalone distilled model. The character’s poses and colors vary markedly among different seeds. This indicates that the Yetter Inference Engine successfully manages the early stages with the base model and preserves the stochastic creativity fundamental to diffusion models.

Quality Comparison

Figure 8. Quality comparison between the Distilled Model and the Yetter Inference Engine. The Yetter Inference Engine shows superior performance, producing naturally expressed images comparable to the Base Model.
Figure 8. Quality comparison between the Distilled Model and the Yetter Inference Engine. The Yetter Inference Engine shows superior performance, producing naturally expressed images comparable to the Base Model.
Second, the engine preserves the fine-grained quality of the original model. As illustrated in Figure 8, the Yetter Inference Engine delivers exceptionally high visual quality comparable to the base model’s output, in sharp contrast to the degraded, oversaturated, and clumped artifacts (e.g., fur textures) produced by the standalone distilled model (shown in Figure 4). This demonstrates that the Yetter Inference Engine maintains the base model’s expressive capacity.

Negative prompt handling

Figure 9. The impact of negative prompts on the Yetter Inference Engine. The engine preserves high quality and maintains high fidelity to the negative prompt's guidance.
Figure 9. The impact of negative prompts on the Yetter Inference Engine. The engine preserves high quality and maintains high fidelity to the negative prompt's guidance.
Third, the engine fully supports and correctly handles negative prompts. Figure 9 shows that the Yetter Inference Engine processes negative prompt guidance reliably, enalbing fine-grained user control over the output. This results from using the base model's CFG-aware architecture in the early steps, a capability completely lost in the distilled model's workflow (see Figure 5). This feature is critical for precise, user-guided image generation, as it offers the most reliable way to handle negations that text encoders often misinterpret.

Latency Benchmark

Table 1. A latency and performance comparison of the Base, Cached, Distilled, and Yetter Inference Engine models on single NVIDIA H100 and single Intel Gaudi 2. We have compiled the models in all rows with torch.compile to increase efficiency.
Table 1. A latency and performance comparison of the Base, Cached, Distilled, and Yetter Inference Engine models on single NVIDIA H100 and single Intel Gaudi 2. We have compiled the models in all rows with torch.compile to increase efficiency.
Finally, and most importantly, Yetter Inference Engine delivers all of these benefits while matching the speed of distilled models that compromise quality. The full comparison appears in Table 1.
The Cached Model preserves all performance characteristics of the Base model but only halves latency, reaching 7.16 s on NVIDIA H100 and 14.70 s on Intel Gaudi 2. The Distilled Model shows a superior 11x speedup, achieving 1.35 s (H100) and 2.76 s (Gaudi 2). However, it suffers from reduced diversity and quality and does not support negative prompts.
Against this backdrop, the results for the Yetter Inference Engine are definitive. The engine achieves 1.57 s on the H100 and 3.07 s on the Gaudi 2. This represents a compelling 10x speedup over the base model, far exceeding the 2x gain from caching, while preserving diversity and quality and fully supporting negative prompts.

Conclusion

Yetter Inference Engine breaks the speed-quality tradeoff that has constrained large-scale diffusion model deployments. By orchestrating a hybrid pipeline of base, bridging, and distilled models, and applying system level optimizations, Yetter Inference Engine achieves a 10x speedup while preserving all the features of the original base model. Moreover, it is broadly applicable across diverse hardware, offering a practical and powerful solution for serving state-of-the-art models at scale in robust production settings.
notion image
If you’re interested, you can try running Yetter Inference Engine via API or on the official website at $0.01 per megapixel. We welcome feedback, benchmarks, and contributions, and we are open to collaboration, so feel free to reach out. This is just the beginning, please keep an eye on our updates and ongoing development.
 
Share article

SqueezeBits