[vLLM vs TensorRT-LLM] #6. Weight-Only Quantization
This article provides a comparative analysis of the effects of weight-only quantization on vLLM and TensorRT-LLM frameworks.
Nov 01, 2024
Introduction
Quantization is a widely used technique for compressing deep learning models and accelerating inference. It is especially valuable for LLMs due to their excessive number of parameters and computational demands. Both vLLM and TensorRT-LLM support diverse quantization methods to provide users practical options for faster LLM serving.
In three upcoming posts, we will explore quantization techniques available in vLLM and TensorRT-LLM. In this article, we will first cover weight-only quantization techniques. The next post will explore weight-activation quantization, and the last post will cover KV cache quantization in long-context scenarios.
The effectiveness of quantization can vary significantly by many factors— model architecture, model size, hardware, model parallelism and etc. In this article, we will use a relatively small Llama-3.1 variant on a single GPU for the ease of analysis.
Weight-Only Quantization
Weight-only quantization reduces the memory footprint of model weights by converting them from high-precision floating-point formats (e.g., 32-bit or 16-bit) to lower precision, such as 8-bit or 4-bit. The 4-bit format has gained popularity recently as it effectively reduces memory usage while preserving accuracy.
In weight-only quantization, only the weights are quantized, leaving activation tensors in high precision. Since GPUs lack arithmetic units for multiplying tensors of mixed precision, quantized weights are dequantized to high precision during inference, introducing extra computations for dequantization (see Figure 1). Thus, the effectiveness of weight-only quantization can vary depending on whether the workload is memory-bound or compute-bound— the dequantization overhead will be more prominent if the workload is compute-bound.
Serving LLMs consists of two phases: the prefill phase and the decode phase. The prefill phase, which processes long input prompts at once, is typically compute-bound. In contrast, the decode phase, handling one token at a time, is memory-bound with small batch sizes and shifts towards compute-bound as batch size increases. Given these varying workload characteristics even within a single inference process, thorough testing is essential to accurately assess the true impact of weight-only quantization on LLM performance.
Quantization Options in vLLM and TensorRT-LLM
There are several weight-only quantization methods for LLMs, with AWQ and GPTQ among the most prominent. Since many weight-only quantization schemes reduce weight bit-width to INT4 or even lower, dedicated computation kernels are necessary to accelerate its computation. Both vLLM and TensorRT-LLM offer multiple kernel options to support and accelerate weight-only quantization schemes.
In vLLM, users can utilize official AWQ kernel for AWQ and the ExLlamaV2 kernel for GPTQ as default options to accelerate weight-only quantized LLMs. Additional kernel options, especially optimized for larger batch sizes, include Marlin and Machete. Marlin kernel is designed for high performance in batched settings and is available for both AWQ and GPTQ in vLLM. Machete, a mixed-precision linear kernel from NeuralMagic, is similar to Marlin in concept but optimized specifically for the Hopper architecture. It claims improved performance over Marlin in highly batched operations and currently supports only GPTQ models, with no applicability for AWQ models. While, TensorRT-LLM offers only two options: its own implementations of AWQ and GPTQ.
Loading weight-only quantized LLMs poses its own challenges, as sub-byte weights require dedicated formats to save and load. vLLM supports directly loading models quantized with the AutoAWQ or AutoGPTQ libraries, allowing users to select from the above kernels to accelerate the loaded model. vLLM also supports LLM quantization through LLM-Compressor suite. In such a case, GPTQ scheme is used for weight-only quantization, and both Marlin and Machete kernels are available.
In contrast, TensorRT-LLM requires additional steps. Models quantized with the AutoGPTQ library must first be converted to a TensorRT-LLM-compatible format. For AWQ, users need to rely on TensorRT-LLM’s quantization suite, Model Optimizer, to perform the quantization and subsequently accelerate the quantized model within TensorRT-LLM.
With a wide range of quantization schemes and kernels available, choosing the best option for a specific service scenario can be challenging. To provide insights into which option may be optimal, we will now dive into a performance comparison of each choice across various parameters, including sequence length and maximum batch size.
Experiment Setup
Quantization Schemes & Kernels
For all experiments involving AWQ and GPTQ, weights are quantized to 4-bit (INT4) format and the group size for scaling factor is set to 128. For vLLM, multiple quantization schemes and kernel combinations are possible. In the AWQ setup, we used models quantized with AutoAWQ library and tested two kernels: official AWQ kernel and Marlin. For GPTQ scheme, AutoGPTQ library was used with ExLlamaV2 kernel. On the other hand, models for Marlin and Machete kernels were prepared using the LLM-Compressor suite for better performance. For TensorRT-LLM, only two options are available: AWQ and GPTQ, both using NVIDIA’s kernels.
- vLLM: AWQ (Official/Marlin), GPTQ (ExLlamaV2/Marlin/Machete) — total 5 options
- TensorRT-LLM: AWQ, GPTQ — total 2 options
Benchmark Dataset
For all experiments, we used datasets with fixed input and output lengths to maintain consistency in the number of processed tokens across both frameworks.
Two distinct datasets were used: prefill-heavy and decode-heavy. Input and output lengths of prefill-heavy dataset are 2,048 and 128 tokens, respectively. On the other hand, input and output lengths of decode-heavy dataset are 128 and 2,048 tokens.
We evaluated using these datasets with varying max batch sizes. The number of tested requests were 256 except for max batch size 256 cases. For max batch size of 256, we evaluated with 1,024 samples for stable measurements.
Framework Version
We selected recent versions of both frameworks that successfully completed the benchmarking process.
- vLLM: v0.6.2 (commit 7193774)
- TensorRT-LLM: 0.13.0 release with C++ API
Model and Hardware
- Model: Llama-3.1–8B-Instruct (FP16, AWQ-INT4, GPTQ-INT4)
- H/W: NVIDIA H100-PCIe 80G GPU, Intel Xeon(R) Platinum 8352Y CPU (32 Cores) @ 2.20GHz
Results
In this article, we focus on throughput comparison between vLLM and TensorRT-LLM along with different options. We measured throughput while varying maximum batch size to assess the effect of weight-only quantization on both memory-bound and compute-bound scenarios. We set the request rate to infinity to determine the maximum achievable throughput for each kernel.
FP16 vs. WOQ Best
Figure 2 shows the throughput improvement by weight-only quantization when max batch size is small. Weight-only quantization resulted in approximately 2x increase in throughput for both vLLM and TensorRT-LLM, regardless of input and output lengths. With smaller batch sizes, where the workload remains memory-bound, weight-only quantization has a significant impact by reducing the amount of data read from memory.
As batch size increases, LLM inference becomes more compute-bound, reducing the throughput gains from weight-only quantization. Still, weight-only quantization maintained throughput comparable to the FP16 baseline even at the largest tested batch size. In decode-heavy scenarios with longer output lengths, weight-only quantization surprisingly provided a throughput boost again despite the compute-bound nature of the task. This was primarily due to the increased active batch size enabled by the reduced weight memory size, as shown in Figure 4.
Figure 4 shows the actual batch size during inference with TensorRT-LLM framework. During FP16 inference, the effective batch size was consistently lower than the max batch size of 256. In contrast, weight-only quantization achieved higher effective batch size, thanks to the reduced model size. This better batch utilization with weight-only quantization resulted in a net throughput gain, as the benefits from batch utilization outweighed the throughput loss caused by dequantization overhead in the compute-bound scenario.
vLLM vs. TensorRT-LLM
Now, we compare throughput of TensorRT-LLM and vLLM with their best weight-only quantization configurations.
Figure 5 shows overall throughput comparison of vLLM and TensorRT-LLM when weight-only quantization is used. In general, TensorRT-LLM showed higher throughput than vLLM for most of the max batch sizes. When the max batch size is 256, TensorRT-LLM achieved 1.18x and 1.15x higher throughput than vLLM on prefill-heavy and decode-heavy workload, respectively. Meanwhile, vLLM outperformed TensorRT-LLM when max batch size was 4 and 16 on decode-heavy dataset.
Additionally, Figure 5 includes a table showing which kernel yields the best throughput for each case. Interestingly, the optimal kernel varies by max batch size. For TensorRT-LLM, with only two kernel options, AWQ performed best at a batch size of 1, while GPTQ was superior in all other cases. For vLLM, however, the ideal kernel option changed with batch size. In decode-heavy scenario, ExLlamaV2 was best for small batch size (1), Marlin excelled at middle batch sizes (4-16), and Machete showed the highest throughput for large batch sizes (64-256).
ExLlamaV2 vs. Marlin vs. Machete
Observing that different kernels in vLLM exhibit varying performance across batch sizes, we further dive into a comparative analysis of these kernels. As both Marlin and Machete kernels claim to be optimized for large batch settings, we compared ExLlamaV2, Marlin and Machete kernels with different batch sizes. Since Machete kernel does not support AWQ, we compared different kernel implementations of vLLM on GPTQ.
Figure 6 shows the throughput comparison when the max batch size is relatively small. When the batch size was 1, the default ExLlamaV2 kernel achieved higher throughput than Marlin or Machete kernel. When the max batch size increased to 4 or 16, Marlin and Machete kernels started to outperform ExLlamaV2 kernel. Especially, when the max batch size was 16, Marlin and Machete kernels showed almost 2x higher throughput than ExLlamaV2 kernel. Meantime, Marlin kernel outperformed Machete kernel and even TensorRT-LLM’s best case in this range.
When it comes to larger batch size, Machete starts to beat the league. This is also consistent with what author of Machete kernel claims which is Machete is a better successor of Marlin at larger batch sizes. Nevertheless, TensorRT-LLM is the best option in these large batch size cases.
Final Thoughts
Our evaluation covers comparison between weight-only quantization scheme/kernel options provided in vLLM and TensorRT-LLM. Overall, TensorRT-LLM showed better performance in most cases, but it is noteworthy that vLLM outperformed TensorRT-LLM in certain cases. In addition, even with the same quantization scheme, choosing optimal kernel can lead to significant performance improvement in vLLM. Although performance gain from weight-only quantization gradually declined as max batch size increases, we could find that better batch utilization can help improving throughput even at large batch sizes.
It is important to note that our evaluations have some limitations. First, our experiments have focused on LLaMA-3.1–8B-Instruct model on single H100 PCIe GPU, and the takeways might not be translated to other models and environment. Using different models (especially, larger models) or model parallelism on multi-GPU cards can expose different tendency in the results. Second, there are many other techniques provided by both frameworks that can be orthogonally applied with quantization, such as chunked prefill. Therefore, it is crutial to conduct various experiments like those shown in this article, tailored to your own model, service scenario, and hardware environment.
In the following post, weight-activation quantization will be covered. By utilizing low-precision Tensor Cores in GPUs for matrix multiplication, weight-activation quantization can reach even higher throughput at large batch sizes. Stay tuned for more insights in the vLLM vs TensorRT-LLM series!
Share article
Join the SqueezeBits newsletter today!