Winning both speed and quality: How Yetter deals with diffusion models
Explore how the Yetter Inference Engine overcomes the limitations of step caching and model distillation for diffusion models. We analyze latency, diversity, quality, and negative-prompt handling to reveal what truly matters for scalable, real-time image generation.
Oct 31, 2025
Contents
IntroductionBreak Points of Existing ApproachesCaching: Quality Preserved, Limited SpeedupSpeedup-Quality Tradeoff AnalysisDistillation: High-Speed, Low-QualitySolution from the Yetter Inference EngineNovel Inference Pipeline for Speed and QualitySystem-Level OptimizationsResults: Compelling Evidence of a Broken TradeoffDiversity ComparisonQuality ComparisonNegative prompt handlingLatency Benchmark ConclusionIntroductionBreak Points of Existing ApproachesCaching: Quality Preserved, Limited SpeedupSpeedup-Quality Tradeoff AnalysisDistillation: High-Speed, Low-QualityDiversity ReductionDegraded QualityInability to Handle Negative PromptsSolution from the Yetter Inference EngineNovel Inference Pipeline for Speed and QualitySystem-Level OptimizationsResults: Compelling Evidence of a Broken TradeoffDiversity ComparisonQuality ComparisonNegative prompt handlingLatency Benchmark Conclusion
Introduction
Diffusion models are now a cornerstone of visual synthesis. Architectural innovations from early U-Net designs to contemporary Diffusion Transformers (DiT) have driven major gains on visual tasks. Scaling these DiT architectures to tens of billions of parameters, as in FLUX.1 and Qwen-Image, has pushed generation quality to an unprecedented levels. 
While scaling the models has yielded remarkable performance gains, computation and memory demand has become severe. Unlike autoregressive language models that exploit KV caching for efficiency, diffusion models require full computation at every denoising step. This step-by-step sampling is a fundamental bottleneck for real-time or high-throughput use, making acceleration essential. Among various strategies, the two most commonly used methods are step caching and model distillation. However, step caching yields only modest speedups, while distillation trades away creativity, diversity, and guidance control for speed.
In this article, we analyze the tradeoffs of previous approaches and introduce Yetter Inference Engine, which overcomes their core limitations. Through our experiments with Qwen-Image, we demonstrate how our system achieves scalable, high-quality diffusion inference without compromising generative quality.
Break Points of Existing Approaches
Caching: Quality Preserved, Limited Speedup

A typical diffusion model involves 10~100 sequential denoising steps to coherently update a noisy latent into a refined image. Caching method, such as First-Block Cache, TaylorSeer, or TeaCache, is an intuitive optimization strategy that predicts the update for current step based on the previous step’s cached information. Since caching methods predict the update instead of fully computing it, the method significantly decreases the computational workload. 
First-block caching is a representative caching method implemented in huggingface diffusers library to increase efficiency (see Figure 1). Unlike the baseline, which unconditionally executes all of the  Multi-Modal DiT (MM-DiT) blocks for each denoising step, first-block caching inserts a checkpoint after the initial block to compare if its residual value (output-input) closely matches that from the previous step. If the difference is smaller than the threshold, system assumes minimal change in the direction of update. Thus, the system skips the remaining  blocks and directly reuses the cached latent for the current step’s output. Since, weight of the model remains the same, caching based methods show advantage in preserving the capacity of the original model. On the other hand, expected speedup of caching methods is limited to 2~3 times to maintain the quality in practice, which is relatively slower than other acceleration methods
Speedup-Quality Tradeoff Analysis

We have evaluated the tradeoff between speedup and quality of the first-block caching method on Qwen-Image model. As show in Figure 2, modifying the threshold up to 2x speedup preserves quality of the original model. However, image starts to blur when pushing speedup beyond 4x and by 6x the image becomes unusable. The reason for quality degradation is because aggressive caching causes the model to skip necessary refinement steps. As a result, caching methods be safe at modest levels, but cannot be pushed to deliver groundbreaking speed practitioners expect.
Distillation: High-Speed, Low-Quality
While caching methods focused on optimizing inference of the original model, distillation methods focus on compressing the denoising process through retraining. Distillation methods, such as DMD2, directly reduce the number of denoising steps by training a new student model. The new model is supervised to mimic the outputs of a full-step teacher model just in a single or few steps. For instance, a popular distilled model, Qwen-Image-Lightning, reduces 50 step inference of the base model into 4~8 steps. To further leverage the speedup, most distilled models remove classifier-free guidance (CFG), doubling the throughput by eliminating the negative prompt pass. All things considered, retrained distilled model can deliver tenfold or even greater speedups for practitioners seeking real-time generation. However, fewer denoising steps inevitably degrade quality. In the following section, we examine three practical limitations of distilled models, explained with Qwen-Image (50 step inference) and Qwen-Image-Lightning (8 step inference). 
Diversity Reduction

First, diversity is reduced. Distillation method trains the student to match the teacher’s output distribution. During this process, the distilled student is encouraged to learn high-probability regions rather than preserving the teacher distribution’s full variance. As a result, the original model’s ability to generate different outcomes from different initial noise diminishes in the distilled model. As shown in Figure 3, distilled model produces highly similar images regardless of the input noise or random seed. This loss of stochasticity constrains diversity and limits imaginative outputs.
Degraded Quality

Second, overall image quality decreases. A distilled model collapses the behavior of multiple denoising steps into few step to reach the final state quickly. This removes the intermediate refinement stages where fine details are rendered. Consequently, even when metric scores appear similar, human evaluations show a clear gap between the base and the distilled model. As shown in Figures 4(a) and 4(c), the distilled model produces excessively intense colors and saturation, resulting an unnatural, oversaturated image. In Figure 4(b), fine, fur-like textures appear less natural and are rendered as clumps instead of being delicately expressed.
Inability to Handle Negative Prompts

The most significant functional limitation is the inability to handle negative prompts. This is critical because most text encoders struggle with interpreting negation. For example, a prompt like "Don’t think about elephants" is likely to create an elephant, as the encoder focuses on the token for elephant. CFG with explicit negative prompts counteracts this by enforcing constraints, making it essential for fine-grained control. However, popular distilled models omit CFG during training to gain an additional 2x speedup and therefore do not accept negative prompts. When a negative prompt is forced at inference, performance degrades sharply. Figure 5 provides a clear illustration, where overall quality drops significantly under negative prompts. In the second example, distilled model fails to follow the instruction, highlighting a fundamental limitation of distilled models for prompt-level guidance. 
Solution from the Yetter Inference Engine
Novel Inference Pipeline for Speed and Quality
The two prior approaches have critical limitations. The caching based method offers only modest speedup, while distillation method constrains the original model’s capabilities. To overcome these limitations, we implemented a novel inference pipeline in the Yetter Inference Engine that delivers breakthrough speedup while preserving quality. 
The core idea behind the engine stems from a simple experimental observation. The model's output diversity and the effect of CFG are highly dependent on the early, high-noise steps of the diffusion process. By contrast, distilled models are efficient in the final, low-noise steps, when they operate on an already well-formed latent.

Based on this insight, we developed a novel inference engine that systematically overcomes prior limitations by utilizing the best suited model for each stage. As shown in Figure 6, Yetter Inference Engine’s multi-stage pipeline begins with the base model, which seeds generation with high diversity and correctly reflects negative prompts in the latent space. The latent representation is then passed to a bridging model that connects the base model’s intermediate output to the distilled model, aligning the update flow. Finally, the distilled model completes the process, generating the final image from the well-formed latent in just a few steps. Throughout the pipeline, the engine applies stage-specific caching to further accelerate execution. This pipeline in the Yetter Inference Engine enables remarkable gains in both speed and quality.
System-Level Optimizations
Despite the novel pipeline design at the algorithmic level, Yetter Inference Engine provides three key capabilities for efficiently scaling AI systems.
- Model Optimization
- Pipeline Optimization
- Graph Compilation
Specifically, the engine reduces model size and latency via quantization and pruning to accelerate diffusion workloads. At the pipeline level, we efficiently schedule models and share weights between the base and distilled variants. Here, lightweight LoRA captures the residual differences and are dynamically injected at inference. We also run inference with compiled computation graphs and make hardware-aware architectural adjustments. These refinements resolve compilation issues and enable the selection of optimal kernels for additional speedups.
All optimizations in Yetter Inference Engine have been validated across diverse hardware, including NVIDIA GPUs, AMD GPUs, Intel Gaudi, Google TPUs, and more. This integrated, hardware-aware approach preserves performance while accelerating end-to-end inference across heterogeneous devices.
Results: Compelling Evidence of a Broken Tradeoff
Yetter Inference Engine is designed to avoid performance compromises, achieving the speed of distillation while preserving the base model’s full features. Our experiments confirm its success on both perspective.
Diversity Comparison

First, the engine can preserve randomness of the original model. As shown in Figure 7, sample diversity from the Yetter Inference Engine is significantly higher than that of the standalone distilled model. The character’s poses and colors vary markedly among different seeds. This indicates that the Yetter Inference Engine successfully manages the early stages with the base model and preserves the stochastic creativity fundamental to diffusion models.
Quality Comparison

Second, the engine preserves the fine-grained quality of the original model. As illustrated in Figure 8, the Yetter Inference Engine delivers exceptionally high visual quality comparable to the base model’s output, in sharp contrast to the degraded, oversaturated, and clumped artifacts (e.g., fur textures) produced by the standalone distilled model (shown in Figure 4). This demonstrates that the Yetter Inference Engine maintains the base model’s expressive capacity.
Negative prompt handling

Third, the engine fully supports and correctly handles negative prompts. Figure 9 shows that the Yetter Inference Engine processes negative prompt guidance reliably, enalbing fine-grained user control over the output. This results from using the base model's CFG-aware architecture in the early steps, a capability completely lost in the distilled model's workflow (see Figure 5). This feature is critical for precise, user-guided image generation, as it offers the most reliable way to handle negations that text encoders often misinterpret.
Latency Benchmark

Finally, and most importantly, Yetter Inference Engine delivers all of these benefits while matching the speed of distilled models that compromise quality. The full comparison appears in Table 1.
The Cached Model preserves all performance characteristics of the Base model but only halves latency, reaching 7.16 s on NVIDIA H100 and 14.70 s on Intel Gaudi 2. The Distilled Model shows a superior 11x speedup, achieving 1.35 s (H100) and 2.76 s (Gaudi 2). However, it suffers from reduced diversity and quality and does not support negative prompts.
Against this backdrop, the results for the Yetter Inference Engine are definitive. The engine achieves 1.57 s on the H100 and 3.07 s on the Gaudi 2. This represents a compelling 10x speedup over the base model, far exceeding the 2x gain from caching, while preserving diversity and quality and fully supporting negative prompts.
Conclusion
Yetter Inference Engine breaks the speed-quality tradeoff that has constrained large-scale diffusion model deployments. By orchestrating a hybrid pipeline of base, bridging, and distilled models, and applying system level optimizations, Yetter Inference Engine achieves a 10x speedup while preserving all the features of the original base model. Moreover, it is broadly applicable across diverse hardware, offering a practical and powerful solution for serving state-of-the-art models at scale in robust production settings. 

If you’re interested, you can try running Yetter Inference Engine via API or on the official website at $0.01 per megapixel. We welcome feedback, benchmarks, and contributions, and we are open to collaboration, so feel free to reach out. This is just the beginning, please keep an eye on our updates and ongoing development.
Share article