Bringing NPUs into Production: Our Journey with Intel Gaudi

SqueezeBits has partnered with Intel to make Gaudi NPUs more usable in practice. We optimized LLMs and diffusion models for Gaudi-2 and created yetter, a generative AI API service.

Sangmin Kim

Jul 01, 2025

Bringing NPUs into Production: Our Journey with Intel Gaudi

Contents

AI Accelerator Meets Real-world Challenges Optimizing Gaudi from the Ground Up Expanding Beyond Language Models From Research to Commercial Service: Launching yetter SqueezeBits at Intel AI Summit 2025

AI Accelerator Meets Real-world Challenges

Neural Processing Units (NPUs) are purpose-built for AI workloads. In theory, they offer powerful computing capabilities, excellent power efficiency, high memory bandwidth, and fluent user experience, making them ideal candidates for modern AI services. In practice, however, developers often encounter substantial obstacles. Many NPUs have immature software stacks, insufficient documentation, and poor integration with widely used open-source frameworks. As a result, promising hardware rarely translates smoothly into real-world productivity, significantly limiting their adoption.

We recognized this usability gap as an opportunity rather than a barrier. Instead of waiting for the ecosystem to mature, we proactively partnered with Intel to transform Intel Gaudi-2 into a production-ready platform. We see ourselves as more than just Gaudi users; we aim to be enablers and pioneers, helping more companies to leverage Gaudi for more efficient AI service deployment.

Optimizing Gaudi from the Ground Up

Last year, we began closely collaborating with Intel to enhance Gaudi-2’s software stack. Our initial efforts focused on rigorous benchmarking with Intel's SynapseAI, followed by targeted optimizations like FP8 quantization, greatly enhancing inference throughput while maintaining accuracy. (Check out our benchmark analysis and FP8 quantization posts.)

A key focus was improving developer experience by integrating Gaudi with popular open-source frameworks. Working with Intel and NAVER Cloud, we optimized Gaudi’s support for vLLM, a widely used inference framework for large language models. We achieved approximately 8.5x throughput improvement and around 8.3x reduction in TPOT processing time. Additionally, we implemented practical features such as structured output generation, Multi-LoRA, and FP8 quantization, significantly easing integration into real-world AI applications.

Our collaboration was intensive but highly rewarding. At first, finding reference materials was challenging, and we frequently sought support from Intel engineers. Intel's response was outstandingly proactive. Their engineers even traveled from Israel and spent a full week onsite with us in Korea to solve these challenges together.

Figure 1. A candid moment with Intel Israel engineers during their visit to Korea.

Expanding Beyond Language Models

Our work with Intel did not stop at optimizing large language models. With NAVER Labs, we optimized their advanced 3D reconstruction model for Gaudi, achieving roughly a 13-fold improvement in throughput compared to the baseline. Additionally, we optimized image and video generation models for Gaudi, demonstrating significant efficiency gains in real-world deployments.

Throughout this extensive process, we noticed Gaudi’s clear advantage over other NPUs. Among the candidates we have tested, Gaudi stands out as the most mature and user-friendly. It integrates smoothly with widely used open-source frameworks like Hugging Face, vLLM, and DeepSpeed, significantly reducing adoption barriers in real-world services.

Figure 2. NAVER LABS’s state-of-the-art 3D reconstruction model optimized on Gaudi-2, achieving a ~13× improvement in throughput.

From Research to Commercial Service: Launching yetter

Leveraging our optimization experience, we created yetter, a generative AI API service powered by Gaudi-2. With the optimized inference framework, yetter efficiently generates high-quality images at low latency and substantially reduced cost compared to GPU-based alternatives. In direct benchmarks, yetter delivered up to tenfold savings in generation costs per image without compromising quality or speed.

yetter reflects our strength in model compression and hardware-level optimization. These capabilities directly translate to lower infrastructure costs, making high-quality generative AI services more accessible.

Video preview — Figure 3. yetter demo showcasing fast, still high-quality image generation using the FLUX text-to-image model optimized for Intel Gaudi-2.

SqueezeBits at Intel AI Summit 2025

We presented our outcomes at Intel AI Summit 2025 on July 1 in Seoul. Our CEO, Hyungjun Kim, joined the speaker session to share practical insights for teams adopting Intel Gaudi. He walked through real-world examples of compressing and optimizing various generative AI workloads on the platform.

At our booth, we demonstrated yetter running on Gaudi-2 and shared our optimization journey, from compiler-level improvements to deployment strategies.

What we’ve accomplished with Gaudi is just the beginning. We are excited to apply these learnings to other NPU platforms, helping more AI teams harness specialized hardware with real impact. With the right optimizations, we believe every NPU can have its day.