Bringing NPUs into Production: Our Journey with Intel Gaudi
SqueezeBits partnered with Intel to optimize Gaudi-2 for generative AI workloads like vLLM and image generation. The result is Yetter: faster, cheaper inference for production AI.
Jul 01, 2025
AI Accelerator Meets Real-world Challenges
Neural Processing Units (NPUs) are purpose-built for AI workloads. In theory, they offer powerful computing capabilities, excellent power efficiency, high memory bandwidth, and fluent user experience, making them ideal candidates for modern AI services. In practice, however, developers often encounter substantial obstacles. Many NPUs have immature software stacks, insufficient documentation, and poor integration with widely used open-source frameworks. As a result, promising hardware rarely translates smoothly into real-world productivity, significantly limiting their adoption.
We recognized this usability gap as an opportunity rather than a barrier. Instead of waiting for the ecosystem to mature, we proactively partnered with Intel to transform Intel Gaudi-2 into a production-ready platform. We see ourselves as more than just Gaudi users; we aim to be enablers and pioneers, helping more companies to leverage Gaudi for more efficient AI service deployment.
Optimizing Gaudi from the Ground Up
Last year, we began closely collaborating with Intel to enhance Gaudi-2’s software stack. Our initial efforts focused on rigorous benchmarking with Intel's SynapseAI, followed by targeted optimizations like FP8 quantization, greatly enhancing inference throughput while maintaining accuracy. (Check out our benchmark analysis and FP8 quantization posts.)
A key focus was improving developer experience by integrating Gaudi with popular open-source frameworks. Working with Intel and NAVER Cloud, we optimized Gaudi’s support for vLLM, a widely used inference framework for large language models. We achieved approximately 8.5x throughput improvement and around 8.3x reduction in TPOT processing time. Additionally, we implemented practical features such as structured output generation, Multi-LoRA, and FP8 quantization, significantly easing integration into real-world AI applications.
Our collaboration was intensive but highly rewarding. At first, finding reference materials was challenging, and we frequently sought support from Intel engineers. Intel's response was outstandingly proactive. Their engineers even traveled from Israel and spent a full week onsite with us in Korea to solve these challenges together.
Expanding Beyond Language Models
Our work with Intel did not stop at optimizing large language models. With NAVER Labs, we optimized their advanced 3D reconstruction model for Gaudi, achieving roughly a 13-fold improvement in throughput compared to the baseline. Additionally, we optimized image and video generation models for Gaudi, demonstrating significant efficiency gains in real-world deployments.
Throughout this extensive process, we noticed Gaudi’s clear advantage over other NPUs. Among the candidates we have tested, Gaudi stands out as the most mature and user-friendly. It integrates smoothly with widely used open-source frameworks like Hugging Face, vLLM, and DeepSpeed, significantly reducing adoption barriers in real-world services.
From Research to Commercial Service: Launching yetter
Leveraging our optimization experience, we created yetter, a generative AI API service powered by Gaudi-2. With the optimized inference framework, yetter efficiently generates high-quality images at low latency and substantially reduced cost compared to GPU-based alternatives. In direct benchmarks, yetter delivered up to tenfold savings in generation costs per image without compromising quality or speed.
yetter reflects our strength in model compression and hardware-level optimization. These capabilities directly translate to lower infrastructure costs, making high-quality generative AI services more accessible.

SqueezeBits at Intel AI Summit 2025
We presented our outcomes at Intel AI Summit 2025 on July 1 in Seoul. Our CEO, Hyungjun Kim, joined the speaker session to share practical insights for teams adopting Intel Gaudi. He walked through real-world examples of compressing and optimizing various generative AI workloads on the platform.
At our booth, we demonstrated yetter running on Gaudi-2 and shared our optimization journey, from compiler-level improvements to deployment strategies.
What we’ve accomplished with Gaudi is just the beginning. We are excited to apply these learnings to other NPU platforms, helping more AI teams harness specialized hardware with real impact. With the right optimizations, we believe every NPU can have its day.

📚 Explore our Intel Gaudi blog series:
Share article
Join the SqueezeBits newsletter today!