SqueezeBits
Vocabulary Trimming: An Easy and Effective Method for SLM Acceleration
Trimming large multilingual vocabularies in Small Language Models (SLM) is a simple, low-risk way to boost efficiency to its limit. It accelerates the model inference significantly while keeping accuracy almost unchanged.
GraLoRA: Boosting Fine-Tuning Accuracy Without Extra Cost
LoRA excels at efficient fine-tuning but suffers at higher ranks due to gradient entanglement. We introduce GraLoRA, which addresses these issues through finer-grained, block-wise updates, significantly enhancing performance and expressivity without overhead. GraLoRA outperforms LoRA across tasks, achieving up to +8.5% improvement in HumanEval+ Pass@1.
[Intel Gaudi] #5. FLUX.1 on Gaudi-2
This article discusses inference efficiency when running the FLUX.1 models on Intel Gaudi-2 hardware.
TensorRT-LLM Goes Open Source!
With TensorRT-LLM now open source, we can finally take a deep dive into the secret sauce behind its impressive performance.