logo
|
Blog
  • Yetter
  • OwLite
  • Fits on Chips
  • SqueezeBits
  • 🌐

Unlock the Potential of AI

Deploy your AI with Maximal Efficiency
See AllTechProductvLLM vs TRT LLMIntel GaudiYetterOwLiteFits on ChipsBiz&InsightResearch
Reliable & Scalable Synthetic Data for Physical AI (Part 2): Making Cosmos 3.1 x Faster for Production

Reliable & Scalable Synthetic Data for Physical AI (Part 2): Making Cosmos 3.1 x Faster for Production

Explore why Physical AI deployment needs synthetic data at scale with Squeezebits' research and discover how to overcome inference bottlenecks to accelerate Roboost Agent.
Jongho Lee's avatar
Daehyun Ahn's avatar
Yeonjoon Jung's avatar
Semin Kim's avatar
Seungryeol Kim's avatar
Mar 11, 2026
TechResearch
Reliable & Scalable Synthetic Data for Physical AI (Part 1): Taming NVIDIA Cosmos with RoBoost Agent

Reliable & Scalable Synthetic Data for Physical AI (Part 1): Taming NVIDIA Cosmos with RoBoost Agent

Scaling Physical AI requires reliable synthetic data. Learn how RoBoost Agent integrates NVIDIA Cosmos to transform world models into trustworthy data engines for robotics and autonomous driving.
Daehyun Ahn's avatar
Jongho Lee's avatar
Yeonjoon Jung's avatar
Semin Kim's avatar
Seungryeol Kim's avatar
Feb 25, 2026
ResearchTech
Introducing rebellions ATOM™-MAX

Introducing rebellions ATOM™-MAX

Introducing ATOM™-Max, rebellions’ next-generation NPU designed for high-performance AI inference. Learn how its runtime, profiling tools, and PyTorch-native integrations enable developers to run and serve models efficiently without sacrificing usability.
Huijong Jeong's avatar
Dec 24, 2025
Tech
Winning both speed and quality: How Yetter deals with diffusion models

Winning both speed and quality: How Yetter deals with diffusion models

Explore how the Yetter Inference Engine overcomes the limitations of step caching and model distillation for diffusion models. We analyze latency, diversity, quality, and negative-prompt handling to reveal what truly matters for scalable, real-time image generation.
Yeonjoon Jung's avatar
Oct 31, 2025
YetterTech
[Intel Gaudi] #6. GEMM, Attention, vLLM on Gaudi

[Intel Gaudi] #6. GEMM, Attention, vLLM on Gaudi

Explore how Intel’s new Gaudi-3 compares to Gaudi-2, NVIDIA A100, and H100. We analyze real-world GEMM efficiency, attention performance, and LLM serving results to uncover what truly matters for AI inference and training workloads.
Taesu Kim's avatar
Oct 28, 2025
Intel Gaudi
Reliable & Scalable Synthetic Data for Physical AI (Part 2): Making Cosmos 3.1 x Faster for Production

Reliable & Scalable Synthetic Data for Physical AI (Part 2): Making Cosmos 3.1 x Faster for Production

Explore why Physical AI deployment needs synthetic data at scale with Squeezebits' research and discover how to overcome inference bottlenecks to accelerate Roboost Agent.
Jongho Lee's avatar
Daehyun Ahn's avatar
Yeonjoon Jung's avatar
Semin Kim's avatar
Seungryeol Kim's avatar
Mar 11, 2026
TechResearch
Reliable & Scalable Synthetic Data for Physical AI (Part 1): Taming NVIDIA Cosmos with RoBoost Agent

Reliable & Scalable Synthetic Data for Physical AI (Part 1): Taming NVIDIA Cosmos with RoBoost Agent

Scaling Physical AI requires reliable synthetic data. Learn how RoBoost Agent integrates NVIDIA Cosmos to transform world models into trustworthy data engines for robotics and autonomous driving.
Daehyun Ahn's avatar
Jongho Lee's avatar
Yeonjoon Jung's avatar
Semin Kim's avatar
Seungryeol Kim's avatar
Feb 25, 2026
ResearchTech
Introducing rebellions ATOM™-MAX

Introducing rebellions ATOM™-MAX

Introducing ATOM™-Max, rebellions’ next-generation NPU designed for high-performance AI inference. Learn how its runtime, profiling tools, and PyTorch-native integrations enable developers to run and serve models efficiently without sacrificing usability.
Huijong Jeong's avatar
Dec 24, 2025
Tech
Winning both speed and quality: How Yetter deals with diffusion models

Winning both speed and quality: How Yetter deals with diffusion models

Explore how the Yetter Inference Engine overcomes the limitations of step caching and model distillation for diffusion models. We analyze latency, diversity, quality, and negative-prompt handling to reveal what truly matters for scalable, real-time image generation.
Yeonjoon Jung's avatar
Oct 31, 2025
YetterTech
[Intel Gaudi] #6. GEMM, Attention, vLLM on Gaudi

[Intel Gaudi] #6. GEMM, Attention, vLLM on Gaudi

Explore how Intel’s new Gaudi-3 compares to Gaudi-2, NVIDIA A100, and H100. We analyze real-world GEMM efficiency, attention performance, and LLM serving results to uncover what truly matters for AI inference and training workloads.
Taesu Kim's avatar
Oct 28, 2025
Intel Gaudi
Yetter, the GenAI API service: AI Optimization, Out of the Box

Yetter, the GenAI API service: AI Optimization, Out of the Box

Meet 'Yetter': the generative AI API service built for speed, efficiency, and scalability. Powered by our optimization inference engine, it delivers reliable image, video, and future LLM services at a fraction of the cost.
Seungryeol Kim's avatar
Oct 02, 2025
TechYetter
Guided Decoding Performance on vLLM and SGLang

Guided Decoding Performance on vLLM and SGLang

The guide to LLM guided decoding! This deep-dive benchmark compares XGrammar and LLGuidance on vLLM and SGLang to help you find the optimal setup for generating structured output based on your use case.
Eunik Park's avatar
Sep 16, 2025
Tech
Disaggregated Inference on Apple Silicon: NPU prefill and GPU decode

Disaggregated Inference on Apple Silicon: NPU prefill and GPU decode

In this article, we introduce how to run LLMs efficiently on Apple Silicon with disaggregated inference technique.
Jiwoong Choi's avatar
Aug 26, 2025
Tech
Vocabulary Trimming: An Easy and Effective Method for SLM Acceleration

Vocabulary Trimming: An Easy and Effective Method for SLM Acceleration

Trimming large multilingual vocabularies in Small Language Models (SLM) is a simple, low-risk way to boost efficiency to its limit. It accelerates the model inference significantly while keeping accuracy almost unchanged.
Semin Kim's avatar
Aug 04, 2025
TechResearch
GraLoRA: Boosting Fine-Tuning Accuracy Without Extra Cost

GraLoRA: Boosting Fine-Tuning Accuracy Without Extra Cost

LoRA excels at efficient fine-tuning but suffers at higher ranks due to gradient entanglement. We introduce GraLoRA, which addresses these issues through finer-grained, block-wise updates, significantly enhancing performance and expressivity without overhead. GraLoRA outperforms LoRA across tasks, achieving up to +8.5% improvement in HumanEval+ Pass@1.
Yeonjoon Jung's avatar
Jul 21, 2025
ResearchTech
OwLite Meets Qualcomm Neural Network: Unlocking On-Device AI Performance

OwLite Meets Qualcomm Neural Network: Unlocking On-Device AI Performance

At SqueezeBits we have been empowering developers to efficiently deploy complex AI models while minimizing performance trade-offs with OwLite toolkit. With OwLite v2.5, we're excited to announce official support for Qualcomm Neural Network (QNN) through seamless integration with Qualcomm AI Hub.
Eunik Park's avatar
Jul 03, 2025
ProductOwLite
Bringing NPUs into Production: Our Journey with Intel Gaudi

Bringing NPUs into Production: Our Journey with Intel Gaudi

SqueezeBits has partnered with Intel to make Gaudi NPUs more usable in practice. We optimized LLMs and diffusion models for Gaudi-2 and created yetter, a generative AI API service.
Jul 01, 2025
Intel GaudiBiz&Insight
How to Quantize Transformer-based model for TensorRT Deployment

How to Quantize Transformer-based model for TensorRT Deployment

This article describes the experimental results of quantized Vision Transformer model and its variants with OwLite.
Daehyun Ahn's avatar
May 20, 2025
OwLite
How to Quantize YOLO models with OwLite

How to Quantize YOLO models with OwLite

This article describes the experimental results of quantized YOLO models with OwLite.
Daehyun Ahn's avatar
May 07, 2025
OwLite
OwLite: No More Compromising on AI Performance After Quantization

OwLite: No More Compromising on AI Performance After Quantization

Discover how OwLite simplifies AI model optimization with seamless integration and secure architecture.
Seungryeol Kim's avatar
Apr 11, 2025
ProductOwLite
[Intel Gaudi] #5. FLUX.1 on Gaudi-2

[Intel Gaudi] #5. FLUX.1 on Gaudi-2

This article discusses inference efficiency when running the FLUX.1 models on Intel Gaudi-2 hardware.
Taesu Kim's avatar
Apr 02, 2025
Intel GaudiTech
TensorRT-LLM Goes Open Source!

TensorRT-LLM Goes Open Source!

With TensorRT-LLM now open source, we can finally take a deep dive into the secret sauce behind its impressive performance.
Huijong Jeong's avatar
Mar 25, 2025
vLLM vs TRT LLMTech
When Should I Use Fits on Chips?

When Should I Use Fits on Chips?

This article describes when to use Fits on Chips toolkit with specific use cases.
Daehyun Ahn's avatar
Mar 10, 2025
ProductFits on ChipsTech
Fits on Chips: Saving LLM Costs Became Easier Than Ever

Fits on Chips: Saving LLM Costs Became Easier Than Ever

This article introduces Fits on Chips, an LLMOps toolkit for performance evaluation.
Seungryeol Kim's avatar
Feb 26, 2025
ProductFits on Chips
SLEB: Streamlining LLMs through Redundancy Verification and Elimination of Transformer Blocks

SLEB: Streamlining LLMs through Redundancy Verification and Elimination of Transformer Blocks

A brief review of the research paper from our team, published at ICML 2024.
Feb 17, 2025
ResearchTech
The Missing Piece of TensorRT-LLM

The Missing Piece of TensorRT-LLM

This article is about an open-source library for direct conversion of PyTorch models to TensorRT-LLM.
Jiwoong Choi's avatar
Feb 10, 2025
TechFits on Chips
The Rise and Fall of ONNX (feat. PyTorch 2.0)

The Rise and Fall of ONNX (feat. PyTorch 2.0)

This article explores the rise and fall of ONNX, from its early success as a unifying stasndard for AI frameworks to its gradual shift into a niche tool in the era of PyTorch 2.0.
Taesu Kim's avatar
Feb 06, 2025
Tech
[vLLM vs TensorRT-LLM] #13. Vision-Language Models

[vLLM vs TensorRT-LLM] #13. Vision-Language Models

This article provides a comparative analysis of serving vision-language models on vLLM and TensorRT-LLM.
Yeonjoon Jung's avatar
Jan 20, 2025
TechvLLM vs TRT LLM
[Intel Gaudi] #4. FP8 Quantization

[Intel Gaudi] #4. FP8 Quantization

In this blog series, we thoroughly evaluate Intel's AI accelerator, the Gaudi series, focusing on its performance, features, and usability.
Minkyu Kim's avatar
Jan 13, 2025
TechIntel Gaudi
[Intel Gaudi] #3. Performance Evaluation with SynapseAI v1.19

[Intel Gaudi] #3. Performance Evaluation with SynapseAI v1.19

In this blog series, we thoroughly evaluate Intel's AI accelerator, the Gaudi series, focusing on its performance, features, and usability.
Taesu Kim's avatar
Jan 06, 2025
TechIntel Gaudi
[vLLM vs TensorRT-LLM] #12. Automatic Prefix Caching

[vLLM vs TensorRT-LLM] #12. Automatic Prefix Caching

This article provides a comparative analysis of automatic prefix caching.
Daehyun Ahn's avatar
Yeonjoon Jung's avatar
Taesu Kim's avatar
Huijong Jeong's avatar
Dec 23, 2024
TechvLLM vs TRT LLM
[vLLM vs TensorRT-LLM] #11. Speculative Decoding

[vLLM vs TensorRT-LLM] #11. Speculative Decoding

This article provides a comparative analysis of speculative decoding.
Daehyun Ahn's avatar
Yeonjoon Jung's avatar
Dec 09, 2024
TechvLLM vs TRT LLM
[vLLM vs TensorRT-LLM] #10 Serving Multiple LoRAs at Once

[vLLM vs TensorRT-LLM] #10 Serving Multiple LoRAs at Once

This article provides a comparative analysis of multi-LoRA serving capabilities of vLLM and TensorRT-LLM frameworks.
Jongho Lee's avatar
Dec 05, 2024
TechvLLM vs TRT LLM
[Intel Gaudi] #2. Graph Compiler and Overall Performance Evaluation

[Intel Gaudi] #2. Graph Compiler and Overall Performance Evaluation

In this blog series, we thoroughly evaluate Intel's AI accelerator, the Gaudi series, focusing on its performance, features, and usability.
Taesu Kim's avatar
Dec 02, 2024
TechIntel Gaudi
[vLLM vs TensorRT-LLM] #9. Parallelism Strategies

[vLLM vs TensorRT-LLM] #9. Parallelism Strategies

This article provides a comparative analysis of different parallelism strategies on vLLM and TensorRT-LLM frameworks.
Changjun Lee's avatar
Nov 26, 2024
TechvLLM vs TRT LLM
[Intel Gaudi] #1. Introduction

[Intel Gaudi] #1. Introduction

In this blog series, we thoroughly evaluate Intel's AI accelerator, the Gaudi series, focusing on its performance, features, and usability.
Taesu Kim's avatar
Nov 21, 2024
TechIntel Gaudi
[vLLM vs TensorRT-LLM] #8. KV Cache Quantization

[vLLM vs TensorRT-LLM] #8. KV Cache Quantization

This article provides a comparative analysis of the effects of KV cache quantization on vLLM and TensorRT-LLM frameworks.
Jiwon Song's avatar
Nov 18, 2024
TechvLLM vs TRT LLM
[vLLM vs TensorRT-LLM] #7. Weight-Activation Quantization

[vLLM vs TensorRT-LLM] #7. Weight-Activation Quantization

This article provides a comparative analysis of the effects of weight-activation quantization on vLLM and TensorRT-LLM frameworks.
Eunik Park's avatar
Nov 11, 2024
TechvLLM vs TRT LLM
[vLLM vs TensorRT-LLM] #6. Weight-Only Quantization

[vLLM vs TensorRT-LLM] #6. Weight-Only Quantization

This article provides a comparative analysis of the effects of weight-only quantization on vLLM and TensorRT-LLM frameworks.
Jiwon Song's avatar
Nov 01, 2024
TechvLLM vs TRT LLM
[vLLM vs TensorRT-LLM] #5. Dynamic Sequence Lengths

[vLLM vs TensorRT-LLM] #5. Dynamic Sequence Lengths

This article provides a comparative analysis of vLLM and TensorRT-LLM frameworks, focusing on performance with fixed and dynamic datasets.
Minkyu Kim's avatar
Oct 30, 2024
TechvLLM vs TRT LLM
[vLLM vs TensorRT-LLM] #4. Which Scheduler Wins? 🔥

[vLLM vs TensorRT-LLM] #4. Which Scheduler Wins? 🔥

This article provides a comparative analysis of schedulers in vLLM and TensorRT-LLM frameworks.
Huijong Jeong's avatar
Oct 24, 2024
TechvLLM vs TRT LLM
[vLLM vs TensorRT-LLM] #3. Understanding Sampling Methods and Their Performance Impact

[vLLM vs TensorRT-LLM] #3. Understanding Sampling Methods and Their Performance Impact

This article provides a comparative analysis of vLLM and TensorRT-LLM frameworks with various sampling methods.
Daehyun Ahn's avatar
Oct 18, 2024
TechvLLM vs TRT LLM
[vLLM vs TensorRT-LLM] #2. Towards Optimal Batching for LLM Serving

[vLLM vs TensorRT-LLM] #2. Towards Optimal Batching for LLM Serving

This article provides a comparative analysis of vLLM and TensorRT-LLM frameworks, focusing on batching configurations and thoroughly examining the effects of maximum batch size and maximum number of tokens.
Yeonjoon Jung's avatar
Oct 11, 2024
TechvLLM vs TRT LLM
[vLLM vs TensorRT-LLM] #1. An Overall Evaluation

[vLLM vs TensorRT-LLM] #1. An Overall Evaluation

This article provides a comparative analysis of vLLM and TensorRT-LLM frameworks for serving LLMs, evaluating their performance based on key metrics like throughput, TTFT, and TPOT to offer insights for practitioners in optimizing LLM deployment strategies.
Yeonjoon Jung's avatar
Oct 01, 2024
TechvLLM vs TRT LLM
How much can we save through compression?

How much can we save through compression?

Estimating the cost savings from model compression.
Jun 26, 2024
Tech
‘Breaking Down’ Tokenizers in LLMs

‘Breaking Down’ Tokenizers in LLMs

An introduction to tokenizers and their implications in language models.
May 16, 2024
Tech
Accuracy Degradation in AI Compression: Myth or Truth?

Accuracy Degradation in AI Compression: Myth or Truth?

Clarifying the misunderstandings in AI model compression
Apr 24, 2024
Tech
Are you getting everything out of your GPUs?

Are you getting everything out of your GPUs?

The Blackwell GPU from GTC 2024 was astonishing. Analysis of the Nvidia GPU evolution & what it means for GPU users.
Apr 23, 2024
Tech
Things to check if your business utilizes AI

Things to check if your business utilizes AI

Do I need to COMPRESS my AI model? : the short answer is “YES” — and here’s why.
Apr 19, 2024
Tech
AI Compression for Acceleration:  4 Key Methods.

AI Compression for Acceleration: 4 Key Methods.

AI model compression for acceleration is essential. The question is HOW? Here are 4 key methodologies.
Apr 15, 2024
Tech

The official SqueezeBits Tech blog

RSS·Powered by Inblog