[Intel Gaudi] #5. FLUX.1 on Gaudi-2
This article discusses inference efficiency when running the FLUX.1 models on Intel Gaudi-2 hardware.
Apr 02, 2025
Introduction

SqueezeBits recently introduced a Gaudi-2 node for internal research and benchmarking, enabling detailed performance analysis across diverse AI workloads. In this post, we share initial results and insights from our evaluations, focusing specifically on addressing whether Intel Gaudi effectively supports workloads beyond large language models (LLMs).
As detailed in previous entries of our Intel Gaudi blog series, Gaudi architecture requires static pre-compilation of computational graphs, presenting inherent challenges for dynamic inference tasks such as LLMs. Although recent optimizations significantly improved Gaudi-2’s performance for dynamic workloads, the architecture remains inherently suited for workloads featuring static tensor shapes.
Given this context, we investigated Gaudi-2's capabilities using the FLUX.1 family of image generation models developed by Black Forest Labs, which are particularly well-suited to static computational graphs. Our evaluations have revealed noteworthy performance and efficiency gains on Gaudi-2.
Overview of FLUX.1
![Figure 2. MM-DiT and Rectified Flow Transformers, which deeply influenced FLUX.1 model family. [source]](https://image.inblog.dev?url=https%3A%2F%2Fwww.notion.so%2Fimage%2Fattachment%253Aa6a57377-f67c-4a36-9787-4bd1442dfa36%253Aimage.png%3Ftable%3Dblock%26id%3D1c9258ac-0943-80d6-9a26-c45a60b8ec30%26cache%3Dv2&w=2048&q=75)
FLUX.1, created by Black Forest Labs, is an innovative family of diffusion-based image generation models reminiscent of Stable Diffusion but with distinct architectural enhancements. Key components of FLUX.1 include Multimodal Transformer backbones (MM-DiT), parallel attention mechanisms, and parallelized diffusion steps, collectively enhancing generation speed and efficiency. Crucially, FLUX.1 operates on fixed-shape latent vectors throughout the diffusion process, aligning ideally with Gaudi-2’s static graph optimization strategy.
The FLUX.1 architecture leverages hybrid transformer blocks combining multimodal diffusion transformers with parallel diffusion transformers and scales effectively up to 12 billion parameters. Notable innovations, such as flow matching and rotary positional embeddings, significantly boost both visual detail and computational performance. Additionally, guidance distillation techniques applied during training further improve computational efficiency without sacrificing generated image quality.
Currently, the FLUX.1 model family includes three distinct versions tailored to different user segments: FLUX.1 [pro] for professional and commercial applications, FLUX.1 [dev] for research and non-commercial experimentation, and FLUX.1 [schnell], optimized for fast, local prototyping. Collectively, these variants showcase significant advancements in text-to-image generation, blending robust architectural designs with practical usability.
FLUX.1 Inference Performance on Intel Gaudi-2
Intel’s optimum-habana framework currently supports FLUX.1, providing optimized inference pipelines tailored specifically for Gaudi hardware. Using this framework, we successfully deployed FLUX.1 [dev] and FLUX.1 [schnell] on our Gaudi-2 node and conducted various benchmark on its inference performance.
Benchmark Setup:
- Model: FLUX.1 [dev] and FLUX.1 [schnell] (BF16)
- Device: Single card
- Sampling Steps: 28 (for [dev]), 4 (for [schnell])
- Sampler: FlowMatchEulerDiscrete
- Image Resolution: 256x256 to 1024x1024
- Batch Size: 1 to 8
Impact of Batch Size
![Figure 3. Generation latency and throughput of FLUX.1 [dev] and FLUX.1 [schnell] on Gaudi-2 for various batch sizes.](https://image.inblog.dev?url=https%3A%2F%2Fwww.notion.so%2Fimage%2Fattachment%253Aa03869f8-69da-4023-a5cb-30e2e1b11c15%253Aimage.png%3Ftable%3Dblock%26id%3D1c9258ac-0943-80c8-a3b3-dd3b804ca9fb%26cache%3Dv2&w=2048&q=75)
To assess how batch size influences throughput and latency, we fixed the resolution at 1024x1024 and incrementally increased batch sizes from 1 to 8. The results showed only minor throughput improvements (~4.5%) with larger batch sizes. This limited scaling is attributed to the inherent design of FLUX.1 models, where additional dimensions within matrix multiplications limit substantial improvements from increased batch sizes. Additionally, Gaudi-2 achieves high utilization even at smaller input matrices, mitigating the benefits of scaling batch size further.
Comparing [schnell] and [dev] variants highlighted significant throughput and latency differences, with [schnell] generating images approximately 7 times faster. This considerable performance advantage arises primarily from the reduced number of sampling steps (4 steps for [schnell] versus 28 steps for [dev]). Gaudi-2 effectively leveraged these reduced sampling requirements, accurately reflecting the expected latency improvements.
Impact of Image Resolution
![Figure 4. Single image generation latency of FLUX.1 [dev] and FLUX.1 [schnell] on Gaudi-2 for various image resolutions.](https://image.inblog.dev?url=https%3A%2F%2Fwww.notion.so%2Fimage%2Fattachment%253Ab25d4d40-151f-43fa-bc65-30af992408e3%253Aimage.png%3Ftable%3Dblock%26id%3D1c9258ac-0943-8012-95c1-c9c50495c4a0%26cache%3Dv2&w=2048&q=75)
We analyzed the influence of image resolution on generation latency and throughput by scaling the resolution from 256x256 to 1024x1024. The benchmarks demonstrated a near-linear increase in latency corresponding to the rise in pixel count. Further increasing resolution to 2048x2048 resulted in out-of-memory errors at batch size 4, despite Gaudi-2’s substantial 96GB HBM2e memory. Thus, higher-resolution generation (2048x2048) may necessitate further model optimization, such as FP8 quantization.
Impact of LoRA
![Figure 5. Single image generation latency of FLUX.1 [dev] on Gaudi-2 with and without LoRA.](https://image.inblog.dev?url=https%3A%2F%2Fwww.notion.so%2Fimage%2Fattachment%253Ae4773b6b-c52d-4b8a-8734-874ceffcd22d%253Aimage.png%3Ftable%3Dblock%26id%3D1c9258ac-0943-8011-a18c-f00380cdfe26%26cache%3Dv2&w=2048&q=75)
Low-Rank Adaptation (LoRA) is commonly employed to efficiently fine-tune models. We evaluated the impact of incorporating LoRA on inference latency to determine additional computational overhead. Our benchmarks indicated that applying LoRA incurred around a 10% latency overhead for generating a single 1024x1024 image with the FLUX.1 [dev] model. Interestingly, varying the LoRA rank (from rank 16 to 64) showed negligible differences in latency, suggesting rank size minimally influences inference performance.
Comparison with NVIDIA GPUs
We benchmarked FLUX.1 [dev] model performance on Gaudi-2 against NVIDIA's high-performance GPUs, referencing performance data available here. Results demonstrated Gaudi-2’s impressive performance, notably surpassing NVIDIA’s H100-PCIe GPU under BF16 precision:
Single image (1024x1024) generation latency:
- Gaudi-2 (Denvr Dataworks, $1.25/h): 8.1 seconds
- NVIDIA H100-SXM (Runpod, $2.99/h): 5.6 seconds
- NVIDIA H100-PCIe (Runpod, $2.69/h): 9.9 seconds
- NVIDIA A100-SXM (Runpod, $1.89/h): 11.1 seconds
Gaudi-2 exhibited superior latency compared to NVIDIA's H100-PCIe and A100 GPUs, with only NVIDIA’s premium H100-SXM model performing faster. However, Gaudi-2’s superior cost efficiency—approximately $0.0028 per image generated—makes it exceptionally attractive for practical deployment scenarios.
Further Improvement coming with FP8 Quantization
![Figure 6. Image generation Results of BF16 and FP8 FLUX.1 [dev] model on a sample prompt: “An orange holding a sign saying SqueezeBits”.](https://image.inblog.dev?url=https%3A%2F%2Fwww.notion.so%2Fimage%2Fattachment%253A7af24405-faaa-4f23-8720-64a721d49c84%253Agaudi-blog5-fig2.png%3Ftable%3Dblock%26id%3D1c9258ac-0943-80ce-a786-fea7a45f4aae%26cache%3Dv2&w=2048&q=75)
As discussed previously in our Intel Gaudi blog series, FP8 quantization support represents a critical advantage for Gaudi-2. Preliminary evaluations indicate negligible differences in image quality between FP8 and BF16, suggesting significant potential for further latency reductions and cost-efficiency gains. FP8 quantization significantly lowers memory requirements and computational overhead, possibly enabling faster inference and lower costs. We are currently expanding our testing to additional output examples to identify any corner cases that might show quality degradation.
Conclusion
Our evaluation of FLUX.1 models on Intel Gaudi-2 clearly demonstrates the hardware's compelling performance and cost advantages for static-tensor AI workloads beyond large language models. With increasing availability of Gaudi-2 and Gaudi-3 across major cloud service providers, leveraging Gaudi hardware represents a significant opportunity to reduce AI inference costs.
P.S.
While this post has focused on image generation models, we also have exciting developments regarding LLMs on Gaudi-2. To showcase Gaudi-2’s capabilities with LLM workloads, we are pleased to announce our Gaudi-2-powered LLM inference endpoint demo—one of the first publicly available Gaudi-based LLM demos. We encourage the community to explore this endpoint and provide feedback.
For comprehensive LLM serving performance evaluations, explore Fits on Chips, our benchmarking platform now supporting both NVIDIA GPUs and Intel Gaudi-2 and Gaudi-3. The platform provides unified inference metrics such as throughput, latency, and popular model evaluation benchmarks including arc challenge, enabling direct comparisons of serving efficiency across different hardware and serving frameworks. Discover more about Fits on Chips here:
Share article
Join the SqueezeBits newsletter today!