Yeonjoon Jung - SqueezeBits

Yeonjoon Jung

Winning both speed and quality: How Yetter deals with diffusion models

Winning both speed and quality: How Yetter deals with diffusion models

Explore how the Yetter Inference Engine overcomes the limitations of step caching and model distillation for diffusion models. We analyze latency, diversity, quality, and negative-prompt handling to reveal what truly matters for scalable, real-time image generation.

GraLoRA: Boosting Fine-Tuning Accuracy Without Extra Cost

GraLoRA: Boosting Fine-Tuning Accuracy Without Extra Cost

LoRA excels at efficient fine-tuning but suffers at higher ranks due to gradient entanglement. We introduce GraLoRA, which addresses these issues through finer-grained, block-wise updates, significantly enhancing performance and expressivity without overhead. GraLoRA outperforms LoRA across tasks, achieving up to +8.5% improvement in HumanEval+ Pass@1.

[vLLM vs TensorRT-LLM] #13. Vision-Language Models

[vLLM vs TensorRT-LLM] #13. Vision-Language Models

This article provides a comparative analysis of serving vision-language models on vLLM and TensorRT-LLM.

TechvLLM vs TRT LLM

[vLLM vs TensorRT-LLM] #12. Automatic Prefix Caching

[vLLM vs TensorRT-LLM] #12. Automatic Prefix Caching

This article provides a comparative analysis of automatic prefix caching.

TechvLLM vs TRT LLM

[vLLM vs TensorRT-LLM] #11. Speculative Decoding

[vLLM vs TensorRT-LLM] #11. Speculative Decoding

This article provides a comparative analysis of speculative decoding.

TechvLLM vs TRT LLM

[vLLM vs TensorRT-LLM] #2. Towards Optimal Batching for LLM Serving

[vLLM vs TensorRT-LLM] #2. Towards Optimal Batching for LLM Serving

This article provides a comparative analysis of vLLM and TensorRT-LLM frameworks, focusing on batching configurations and thoroughly examining the effects of maximum batch size and maximum number of tokens.

TechvLLM vs TRT LLM

[vLLM vs TensorRT-LLM] #1. An Overall Evaluation

[vLLM vs TensorRT-LLM] #1. An Overall Evaluation

This article provides a comparative analysis of vLLM and TensorRT-LLM frameworks for serving LLMs, evaluating their performance based on key metrics like throughput, TTFT, and TPOT to offer insights for practitioners in optimizing LLM deployment strategies.

TechvLLM vs TRT LLM

SqueezeBits

RSS·Powered by Inblog