Tech - SqueezeBits

See All Tech Product vLLM vs TRT LLM Intel Gaudi OwLite Biz&Insight Fits on Chips Research

Yetter, the GenAI API service: AI Optimization, Out of the Box

Meet 'Yetter': the generative AI API service built for speed, efficiency, and scalability. Powered by our optimization inference engine, it delivers reliable image, video, and future LLM services at a fraction of the cost.

Oct 02, 2025

TechProduct

Guided Decoding Performance on vLLM and SGLang

The guide to LLM guided decoding! This deep-dive benchmark compares XGrammar and LLGuidance on vLLM and SGLang to help you find the optimal setup for generating structured output based on your use case.

Sep 16, 2025

Tech

Disaggregated Inference on Apple Silicon: NPU prefill and GPU decode

In this article, we introduce how to run LLMs efficiently on Apple Silicon with disaggregated inference technique.

Aug 26, 2025

Tech

Vocabulary Trimming: An Easy and Effective Method for SLM Acceleration

Trimming large multilingual vocabularies in Small Language Models (SLM) is a simple, low-risk way to boost efficiency to its limit. It accelerates the model inference significantly while keeping accuracy almost unchanged.

Aug 04, 2025

TechResearch

GraLoRA: Boosting Fine-Tuning Accuracy Without Extra Cost

LoRA excels at efficient fine-tuning but suffers at higher ranks due to gradient entanglement. We introduce GraLoRA, which addresses these issues through finer-grained, block-wise updates, significantly enhancing performance and expressivity without overhead. GraLoRA outperforms LoRA across tasks, achieving up to +8.5% improvement in HumanEval+ Pass@1.

Jul 21, 2025

TechResearch

Yetter, the GenAI API service: AI Optimization, Out of the Box

Oct 02, 2025

TechProduct

Guided Decoding Performance on vLLM and SGLang

Sep 16, 2025

Tech

Disaggregated Inference on Apple Silicon: NPU prefill and GPU decode

In this article, we introduce how to run LLMs efficiently on Apple Silicon with disaggregated inference technique.

Aug 26, 2025

Tech

Vocabulary Trimming: An Easy and Effective Method for SLM Acceleration

Aug 04, 2025

TechResearch

GraLoRA: Boosting Fine-Tuning Accuracy Without Extra Cost

Jul 21, 2025

TechResearch

[Intel Gaudi] #5. FLUX.1 on Gaudi-2

This article discusses inference efficiency when running the FLUX.1 models on Intel Gaudi-2 hardware.

Apr 02, 2025

TechIntel Gaudi

TensorRT-LLM Goes Open Source!

With TensorRT-LLM now open source, we can finally take a deep dive into the secret sauce behind its impressive performance.

Mar 25, 2025

TechvLLM vs TRT LLM

When Should I Use Fits on Chips?

This article describes when to use Fits on Chips toolkit with specific use cases.

Mar 10, 2025

TechProductFits on Chips

SLEB: Streamlining LLMs through Redundancy Verification and Elimination of Transformer Blocks

A brief review of the research paper from our team, published at ICML 2024.

Feb 17, 2025

TechResearch

The Missing Piece of TensorRT-LLM

This article is about an open-source library for direct conversion of PyTorch models to TensorRT-LLM.

Feb 10, 2025

TechFits on Chips

The Rise and Fall of ONNX (feat. PyTorch 2.0)

This article explores the rise and fall of ONNX, from its early success as a unifying stasndard for AI frameworks to its gradual shift into a niche tool in the era of PyTorch 2.0.

Feb 06, 2025

Tech

[vLLM vs TensorRT-LLM] #13. Vision-Language Models

This article provides a comparative analysis of serving vision-language models on vLLM and TensorRT-LLM.

Jan 20, 2025

TechvLLM vs TRT LLM

[Intel Gaudi] #4. FP8 Quantization

In this blog series, we thoroughly evaluate Intel's AI accelerator, the Gaudi series, focusing on its performance, features, and usability.

Jan 13, 2025

TechIntel Gaudi

[Intel Gaudi] #3. Performance Evaluation with SynapseAI v1.19

In this blog series, we thoroughly evaluate Intel's AI accelerator, the Gaudi series, focusing on its performance, features, and usability.

Jan 06, 2025

TechIntel Gaudi

[vLLM vs TensorRT-LLM] #12. Automatic Prefix Caching

This article provides a comparative analysis of automatic prefix caching.

Dec 23, 2024

TechvLLM vs TRT LLM

[vLLM vs TensorRT-LLM] #11. Speculative Decoding

This article provides a comparative analysis of speculative decoding.

Dec 09, 2024

TechvLLM vs TRT LLM

[vLLM vs TensorRT-LLM] #10 Serving Multiple LoRAs at Once

This article provides a comparative analysis of multi-LoRA serving capabilities of vLLM and TensorRT-LLM frameworks.

Dec 05, 2024

TechvLLM vs TRT LLM

[Intel Gaudi] #2. Graph Compiler and Overall Performance Evaluation

In this blog series, we thoroughly evaluate Intel's AI accelerator, the Gaudi series, focusing on its performance, features, and usability.

Dec 02, 2024

TechIntel Gaudi

[vLLM vs TensorRT-LLM] #9. Parallelism Strategies

This article provides a comparative analysis of different parallelism strategies on vLLM and TensorRT-LLM frameworks.

Nov 26, 2024

TechvLLM vs TRT LLM

[Intel Gaudi] #1. Introduction

In this blog series, we thoroughly evaluate Intel's AI accelerator, the Gaudi series, focusing on its performance, features, and usability.

Nov 21, 2024

TechIntel Gaudi

[vLLM vs TensorRT-LLM] #8. KV Cache Quantization

This article provides a comparative analysis of the effects of KV cache quantization on vLLM and TensorRT-LLM frameworks.

Nov 18, 2024

TechvLLM vs TRT LLM

[vLLM vs TensorRT-LLM] #7. Weight-Activation Quantization

This article provides a comparative analysis of the effects of weight-activation quantization on vLLM and TensorRT-LLM frameworks.

Nov 11, 2024

TechvLLM vs TRT LLM

[vLLM vs TensorRT-LLM] #6. Weight-Only Quantization

This article provides a comparative analysis of the effects of weight-only quantization on vLLM and TensorRT-LLM frameworks.

Nov 01, 2024

TechvLLM vs TRT LLM

[vLLM vs TensorRT-LLM] #5. Dynamic Sequence Lengths

This article provides a comparative analysis of vLLM and TensorRT-LLM frameworks, focusing on performance with fixed and dynamic datasets.

Oct 30, 2024

TechvLLM vs TRT LLM

[vLLM vs TensorRT-LLM] #4. Which Scheduler Wins? 🔥

This article provides a comparative analysis of schedulers in vLLM and TensorRT-LLM frameworks.

Oct 24, 2024

TechvLLM vs TRT LLM

[vLLM vs TensorRT-LLM] #3. Understanding Sampling Methods and Their Performance Impact

This article provides a comparative analysis of vLLM and TensorRT-LLM frameworks with various sampling methods.

Oct 18, 2024

TechvLLM vs TRT LLM

[vLLM vs TensorRT-LLM] #2. Towards Optimal Batching for LLM Serving

This article provides a comparative analysis of vLLM and TensorRT-LLM frameworks, focusing on batching configurations and thoroughly examining the effects of maximum batch size and maximum number of tokens.

Oct 11, 2024

TechvLLM vs TRT LLM

[vLLM vs TensorRT-LLM] #1. An Overall Evaluation

This article provides a comparative analysis of vLLM and TensorRT-LLM frameworks for serving LLMs, evaluating their performance based on key metrics like throughput, TTFT, and TPOT to offer insights for practitioners in optimizing LLM deployment strategies.

Oct 01, 2024

TechvLLM vs TRT LLM

How much can we save through compression?

Estimating the cost savings from model compression.

Jun 26, 2024

Tech

‘Breaking Down’ Tokenizers in LLMs

An introduction to tokenizers and their implications in language models.

May 16, 2024

Tech

Accuracy Degradation in AI Compression: Myth or Truth?

Clarifying the misunderstandings in AI model compression

Apr 24, 2024

Tech

Are you getting everything out of your GPUs?

The Blackwell GPU from GTC 2024 was astonishing. Analysis of the Nvidia GPU evolution & what it means for GPU users.

Apr 23, 2024

Tech

Things to check if your business utilizes AI

Do I need to COMPRESS my AI model? : the short answer is “YES” — and here’s why.

Apr 19, 2024

Tech

AI Compression for Acceleration: 4 Key Methods.

AI model compression for acceleration is essential. The question is HOW? Here are 4 key methodologies.

Apr 15, 2024

Tech