SqueezeBits

See All Tech Product vLLM vs TRT LLM Intel Gaudi OwLite Biz&Insight Fits on Chips Research

GraLoRA: Boosting Fine-Tuning Accuracy Without Extra Cost

LoRA excels at efficient fine-tuning but suffers at higher ranks due to gradient entanglement. We introduce GraLoRA, which addresses these issues through finer-grained, block-wise updates, significantly enhancing performance and expressivity without overhead. GraLoRA outperforms LoRA across tasks, achieving up to +8.5% improvement in HumanEval+ Pass@1.

Jul 21, 2025

TechResearch

OwLite Meets Qualcomm Neural Network: Unlocking On-Device AI Performance

At SqueezeBits we have been empowering developers to efficiently deploy complex AI models while minimizing performance trade-offs with OwLite toolkit. With OwLite v2.5, we're excited to announce official support for Qualcomm Neural Network (QNN) through seamless integration with Qualcomm AI Hub.

Jul 03, 2025

ProductOwLite

Bringing NPUs into Production: Our Journey with Intel Gaudi

SqueezeBits has partnered with Intel to make Gaudi NPUs more usable in practice. We optimized LLMs and diffusion models for Gaudi-2 and created yetter, a generative AI API service.

Jul 01, 2025

Intel GaudiBiz&Insight

How to Quantize Transformer-based model for TensorRT Deployment

This article describes the experimental results of quantized Vision Transformer model and its variants with OwLite.

May 20, 2025

OwLite

How to Quantize YOLO models with OwLite

This article describes the experimental results of quantized YOLO models with OwLite.

May 07, 2025

OwLite

GraLoRA: Boosting Fine-Tuning Accuracy Without Extra Cost

Jul 21, 2025

TechResearch

OwLite Meets Qualcomm Neural Network: Unlocking On-Device AI Performance

Jul 03, 2025

ProductOwLite

Bringing NPUs into Production: Our Journey with Intel Gaudi

SqueezeBits has partnered with Intel to make Gaudi NPUs more usable in practice. We optimized LLMs and diffusion models for Gaudi-2 and created yetter, a generative AI API service.

Jul 01, 2025

Intel GaudiBiz&Insight

How to Quantize Transformer-based model for TensorRT Deployment

This article describes the experimental results of quantized Vision Transformer model and its variants with OwLite.

May 20, 2025

OwLite

How to Quantize YOLO models with OwLite

This article describes the experimental results of quantized YOLO models with OwLite.

May 07, 2025

OwLite

OwLite: No More Compromising on AI Performance After Quantization

Discover how OwLite simplifies AI model optimization with seamless integration and secure architecture.

Apr 11, 2025

ProductOwLite

[Intel Gaudi] #5. FLUX.1 on Gaudi-2

This article discusses inference efficiency when running the FLUX.1 models on Intel Gaudi-2 hardware.

Apr 02, 2025

TechIntel Gaudi

TensorRT-LLM Goes Open Source!

With TensorRT-LLM now open source, we can finally take a deep dive into the secret sauce behind its impressive performance.

Mar 25, 2025

TechvLLM vs TRT LLM

When Should I Use Fits on Chips?

This article describes when to use Fits on Chips toolkit with specific use cases.

Mar 10, 2025

TechProductFits on Chips

Fits on Chips: Saving LLM Costs Became Easier Than Ever

This article introduces Fits on Chips, an LLMOps toolkit for performance evaluation.

Feb 26, 2025

ProductFits on Chips

SLEB: Streamlining LLMs through Redundancy Verification and Elimination of Transformer Blocks

A brief review of the research paper from our team, published at ICML 2024.

Feb 17, 2025

TechResearch

The Missing Piece of TensorRT-LLM

This article is about an open-source library for direct conversion of PyTorch models to TensorRT-LLM.

Feb 10, 2025

TechFits on Chips

The Rise and Fall of ONNX (feat. PyTorch 2.0)

This article explores the rise and fall of ONNX, from its early success as a unifying stasndard for AI frameworks to its gradual shift into a niche tool in the era of PyTorch 2.0.

Feb 06, 2025

Tech

[vLLM vs TensorRT-LLM] #13. Vision-Language Models

This article provides a comparative analysis of serving vision-language models on vLLM and TensorRT-LLM.

Jan 20, 2025

TechvLLM vs TRT LLM

[Intel Gaudi] #4. FP8 Quantization

In this blog series, we thoroughly evaluate Intel's AI accelerator, the Gaudi series, focusing on its performance, features, and usability.

Jan 13, 2025

TechIntel Gaudi

[Intel Gaudi] #3. Performance Evaluation with SynapseAI v1.19

In this blog series, we thoroughly evaluate Intel's AI accelerator, the Gaudi series, focusing on its performance, features, and usability.

Jan 06, 2025

TechIntel Gaudi

[vLLM vs TensorRT-LLM] #12. Automatic Prefix Caching

This article provides a comparative analysis of automatic prefix caching.

Dec 23, 2024

TechvLLM vs TRT LLM

[vLLM vs TensorRT-LLM] #11. Speculative Decoding

This article provides a comparative analysis of speculative decoding.

Dec 09, 2024

TechvLLM vs TRT LLM

[vLLM vs TensorRT-LLM] #10 Serving Multiple LoRAs at Once

This article provides a comparative analysis of multi-LoRA serving capabilities of vLLM and TensorRT-LLM frameworks.

Dec 05, 2024

TechvLLM vs TRT LLM

[Intel Gaudi] #2. Graph Compiler and Overall Performance Evaluation

In this blog series, we thoroughly evaluate Intel's AI accelerator, the Gaudi series, focusing on its performance, features, and usability.

Dec 02, 2024

TechIntel Gaudi

[vLLM vs TensorRT-LLM] #9. Parallelism Strategies

This article provides a comparative analysis of different parallelism strategies on vLLM and TensorRT-LLM frameworks.

Nov 26, 2024

TechvLLM vs TRT LLM

[Intel Gaudi] #1. Introduction

In this blog series, we thoroughly evaluate Intel's AI accelerator, the Gaudi series, focusing on its performance, features, and usability.

Nov 21, 2024

TechIntel Gaudi

[vLLM vs TensorRT-LLM] #8. KV Cache Quantization

This article provides a comparative analysis of the effects of KV cache quantization on vLLM and TensorRT-LLM frameworks.

Nov 18, 2024

TechvLLM vs TRT LLM

[vLLM vs TensorRT-LLM] #7. Weight-Activation Quantization

This article provides a comparative analysis of the effects of weight-activation quantization on vLLM and TensorRT-LLM frameworks.

Nov 11, 2024

TechvLLM vs TRT LLM

[vLLM vs TensorRT-LLM] #6. Weight-Only Quantization

This article provides a comparative analysis of the effects of weight-only quantization on vLLM and TensorRT-LLM frameworks.

Nov 01, 2024

TechvLLM vs TRT LLM

[vLLM vs TensorRT-LLM] #5. Dynamic Sequence Lengths

This article provides a comparative analysis of vLLM and TensorRT-LLM frameworks, focusing on performance with fixed and dynamic datasets.

Oct 30, 2024

TechvLLM vs TRT LLM

[vLLM vs TensorRT-LLM] #4. Which Scheduler Wins? 🔥

This article provides a comparative analysis of schedulers in vLLM and TensorRT-LLM frameworks.

Oct 24, 2024

TechvLLM vs TRT LLM

[vLLM vs TensorRT-LLM] #3. Understanding Sampling Methods and Their Performance Impact

This article provides a comparative analysis of vLLM and TensorRT-LLM frameworks with various sampling methods.

Oct 18, 2024

TechvLLM vs TRT LLM

[vLLM vs TensorRT-LLM] #2. Towards Optimal Batching for LLM Serving

This article provides a comparative analysis of vLLM and TensorRT-LLM frameworks, focusing on batching configurations and thoroughly examining the effects of maximum batch size and maximum number of tokens.

Oct 11, 2024

TechvLLM vs TRT LLM

[vLLM vs TensorRT-LLM] #1. An Overall Evaluation

This article provides a comparative analysis of vLLM and TensorRT-LLM frameworks for serving LLMs, evaluating their performance based on key metrics like throughput, TTFT, and TPOT to offer insights for practitioners in optimizing LLM deployment strategies.

Oct 01, 2024

TechvLLM vs TRT LLM

How much can we save through compression?

Estimating the cost savings from model compression.

Jun 26, 2024

Tech

‘Breaking Down’ Tokenizers in LLMs

An introduction to tokenizers and their implications in language models.

May 16, 2024

Tech

Accuracy Degradation in AI Compression: Myth or Truth?

Clarifying the misunderstandings in AI model compression

Apr 24, 2024

Tech

Are you getting everything out of your GPUs?

The Blackwell GPU from GTC 2024 was astonishing. Analysis of the Nvidia GPU evolution & what it means for GPU users.

Apr 23, 2024

Tech

Things to check if your business utilizes AI

Do I need to COMPRESS my AI model? : the short answer is “YES” — and here’s why.

Apr 19, 2024

Tech

AI Compression for Acceleration: 4 Key Methods.

AI model compression for acceleration is essential. The question is HOW? Here are 4 key methodologies.

Apr 15, 2024

Tech