SqueezeBits
Guided Decoding Performance on vLLM and SGLang
The guide to LLM guided decoding! This deep-dive benchmark compares XGrammar and LLGuidance on vLLM and SGLang to help you find the optimal setup for generating structured output based on your use case.
Disaggregated Inference on Apple Silicon: NPU prefill and GPU decode
In this article, we introduce how to run LLMs efficiently on Apple Silicon with disaggregated inference technique.
Vocabulary Trimming: An Easy and Effective Method for SLM Acceleration
Trimming large multilingual vocabularies in Small Language Models (SLM) is a simple, low-risk way to boost efficiency to its limit. It accelerates the model inference significantly while keeping accuracy almost unchanged.
GraLoRA: Boosting Fine-Tuning Accuracy Without Extra Cost
LoRA excels at efficient fine-tuning but suffers at higher ranks due to gradient entanglement. We introduce GraLoRA, which addresses these issues through finer-grained, block-wise updates, significantly enhancing performance and expressivity without overhead. GraLoRA outperforms LoRA across tasks, achieving up to +8.5% improvement in HumanEval+ Pass@1.