SqueezeBits
GraLoRA: Boosting Fine-Tuning Accuracy Without Extra Cost
LoRA excels at efficient fine-tuning but suffers at higher ranks due to gradient entanglement. We introduce GraLoRA, which addresses these issues through finer-grained, block-wise updates, significantly enhancing performance and expressivity without overhead. GraLoRA outperforms LoRA across tasks, achieving up to +8.5% improvement in HumanEval+ Pass@1.
OwLite Meets Qualcomm Neural Network: Unlocking On-Device AI Performance
At SqueezeBits we have been empowering developers to efficiently deploy complex AI models while minimizing performance trade-offs with OwLite toolkit. With OwLite v2.5, we're excited to announce official support for Qualcomm Neural Network (QNN) through seamless integration with Qualcomm AI Hub.
Bringing NPUs into Production: Our Journey with Intel Gaudi
SqueezeBits has partnered with Intel to make Gaudi NPUs more usable in practice. We optimized LLMs and diffusion models for Gaudi-2 and created yetter, a generative AI API service.
How to Quantize Transformer-based model for TensorRT Deployment
This article describes the experimental results of quantized Vision Transformer model and its variants with OwLite.