[vLLM vs TensorRT-LLM] #1. An Overall Evaluation
This article provides a comparative analysis of vLLM and TensorRT-LLM frameworks for serving LLMs, evaluating their performance based on key metrics like throughput, TTFT, and TPOT to offer insights for practitioners in optimizing LLM deployment strategies.
Oct 01, 2024
Contents
IntroductionPreliminariesUnderstanding Key Metrics in LLM ServingExperiment SetupBenchmark DatasetFramework VersionModel and HardwarePerformance with default configurationResultsScenario#1: TPOT Constrained ScenarioTPOT and Batch SizeResultsScenario #2: TTFT Constrained ScenarioTTFT and Request RateResultsFinal ThoughtsIntroduction
vLLM and TensorRT-LLM are two leading frameworks for efficiently serving Large Language Models (LLMs). vLLM is a fast, user-friendly library that supports LLM inference and serving across multiple devices, including NVIDIA, AMD, and Intel GPUs. In contrast, TensorRT-LLM is a highly optimized toolbox designed to accelerate inference performance exclusively on NVIDIA GPUs. Both frameworks are designed to maximize inference speed and resource utilization while minimizing latency.
This article provides an intuitive comparison of vLLM and TensorRT-LLM. To ensure a fair evaluation, we selected a commonly used LLM model and an industry-standard NVIDIA GPU: Llama-3-8B and the A100-SXM 80G GPU. We evaluated both frameworks using their default settings and then explored more optimal configurations under specific real-world scenarios. Our goal is to offer valuable insights for practitioners looking for the most suitable solution for their LLM deployment strategies.
Preliminaries
Understanding Key Metrics in LLM Serving
Evaluating the performance of LLMs requires an understanding of three key metrics: Throughput, Time-to-First-Token (TTFT), and Time-Per-Output-Token (TPOT). Each metric and related parameters are shown in Figure 1.
Throughput (Tokens/s)
- Throughput refers to the number of tokens the system can generate in an unit of time. It is calculated as the total number of generated tokens divided by the total inference time. High throughput indicates that the system can efficiently handle a large volume of requests, which is crucial for real-time applications and serving many users simultaneously.
Time-to-First-Token (TTFT, s)
- TTFT measures the latency between receiving a request and generating the request’s first token. This metric is critical for user experience, particularly in interactive applications where immediate feedback is expected. Lower TTFT means quicker initial response and more responsive application.
Time-Per-Output-Token (TPOT, ms)
- TPOT, also known as inter-token-latency (ITL), is the average time to generate each subsequent token after the first token. This metric provides insight into the model’s token generation speed during inference. Lower TPOT results in faster and seamless token generation.
By monitoring and optimizing throughput, TTFT, and TPOT, practitioners can make informed decisions about model deployment, resource allocation, and system configurations. Thus, we focus on these performance indicators during the comparison between vLLM and TensorRT-LLM.
Experiment Setup
Benchmark Dataset
For all experiments, we used datasets with fixed input and output lengths to ensure consistency in the number of processed tokens across both frameworks. Both vLLM and TensorRT-LLM support the creation of fixed-length datasets composed of random tokens. In vLLM, demanded input and output lengths are directly provided to the
benchmark_serving.py
script, while in TensorRT-LLM, the dataset is generated using a separate prepare_dataset.py
script.Framework Version
We selected the recent versions of both frameworks that successfully completed the benchmarking process. For vLLM, we used v0.6.1 (commit 530821d0), and for TensorRT-LLM, we used 0.14.0dev2024091000 with C++ API.
Model and Hardware
- Model: Llama-3-8B (BF16)
- H/W: NVIDIA A100-SXM 80G GPU
Performance with default configuration
- Workload: Four random datasets, each containing 4096 samples with fixed input and output lengths: (128, 128), (2048, 128), (128, 2048), (2048, 2048)
We evaluated the default settings of vLLM and TensorRT-LLM using datasets with varying input and output length combinations. To prevent memory related errors, the maximum sequence length was set as the sum of the input and output lengths for each dataset. All other settings were kept as default.
Results
As shown in Figure 2, TensorRT-LLM demonstrated superior performance across all metrics compared to vLLM with default configurations. Specifically, in dataset with short input and output lengths, TensorRT-LLM showed 1.34x higher throughput than vLLM. For dataset with long input and output lengths, TensorRT-LLM excelled in TPOT, delivering a 2.72x gain over vLLM.
However, the two frameworks showed marginal differences in overall metrics for the (2048, 128) dataset. Therefore, in the following sections, we focused on this dataset to explore in detail how different configurations affect performance in real-world scenarios.
Lastly, both frameworks exhibited extremely high TTFT compared to practical use cases because the default request rate was set to infinity. The relationship between request rate and TTFT will be covered in the following sections (Figure 5).
Scenario#1: TPOT Constrained Scenario
In the previous section, we examined the performance comparison between vLLM and TensorRT-LLM under the default configurations. However, in many real-world applications, there often exists specific service requirements. In such a case, the default configuration might not be applicable and could require additional optimization to meet the requirements.
In this scenario, TPOT is the critical constraint. TPOT is a metric closely tied to user experience, and optimizing for fast TPOT is often a priority in LLM services (see 800 token-per-second demo from Groq). The setup for Scenario #1 is as follows:
- Workload: random dataset containing 4096 samples with fixed input and output length (2048, 128)
- Requirements: TPOT must be smaller than 20ms
- Goal: Maximum Throughput
The default configuration of both frameworks cannot meet the strict TPOT constraint in Scenario #1, requiring adjustments to the default settings.
TPOT and Batch Size
For both vLLM and TensorRT-LLM, there are various options we can control to minimize TPOT. Among all, we decided to control batch size in this section.
Batch size plays a key role in balancing TPOT and throughput. Inference with larger batch size is a more compute-intensive workload, therefore leading to higher throughput, while inference with smaller batch size results in faster inference iteration (lower TPOT).
We conducted experiments by varying the maximum batch sizes for both vLLM and TensorRT-LLM, keeping all other framework settings the same as in the Default Configuration. The goal of this experiment is to identify the optimal batch size that meets the TPOT constraint while maximizing the throughput.
Results
Figure 3 shows that TensorRT-LLM consistently maintained a slightly lower (but marginal) TPOT compared to vLLM across all batch sizes. On the other hand, as the maximum batch size increases, throughput for both frameworks reaches a significantly different saturation point, where TensorRT-LLM achieves higher throughput than vLLM.
In contrast, we observed a different trend in throughput when applying a strict 20ms TPOT constraint, which allows relatively small batch sizes only. As highlighted in Figure 4, maximum batch size of 4 was the best option for both vLLM and TensorRT-LLM under this constraint. In such a case, vLLM achieved 230 Tokens/s, outperforming TensorRT-LLM’s 197 Tokens/s, making vLLM the better option in this scenario.
Scenario #2: TTFT Constrained Scenario
This time, let’s assume we have a hard TTFT constraint. In real-time interaction tasks, such as chatbots or virtual assistants, users expect immediate feedback. A low TTFT ensures quick responses leading to natural conversation flow, while a high TTFT makes the system feel slow and unresponsive. For this experiment, we assumed a TTFT limit of less than 1 second, aiming for near-instant responses.
- Workload: random dataset containing 512 samples with fixed input and output length (2048, 128)
- Requirements: TTFT must be smaller than 1 second
- Goal: Maximize throughput
TTFT and Request Rate
Figure 5 explains closely linked relationship between TTFT and request rate. When the request rate is low, each request is completed before the next one arrives, preventing any queuing. Therefore, TTFT is almost identical with prefill-phase latency. However, as the request rate increases, processing time exceeds the interval between incoming requests, causing queuing delays that grow with the number of requests.
Meanwhile, the default value of request rate in both vLLM and TensorRT-LLM is infinite, which means all the requests arrive as soon as the benchmark starts (see Figure 5c). In such a case, TTFT for the later requests would be extremely high which justifies the results in Figure 2.
In this section, we tried different request rates to find out the maximum request rate each framework could handle while satisfying the TTFT constraint. Apart from the request rate, all other framework settings remained the same as in the Default Configuration.
Results
As shown in Figure 6, TensorRT-LLM consistently outperformed vLLM in TTFT at varying request rates. Under the 1 second TTFT constraint, TensorRT-LLM can handle up to 6 requests per second, while vLLM can handle maximum 5 requests per second. TensorRT-LLM achieves 743.44 Tokens/s with 6 requests per second while vLLM achieves 638.94 Tokens/s with 5 requests per second. Thus, TensorRT-LLM can achieve 16.4% higher throughput with the same 1 second TTFT constraint.
This difference in request handling can significantly impact serving costs in scenarios with low TTFT requirements and high request rates, as vLLM would need additional GPU resources to manage higher loads, whereas TensorRT-LLM achieves this with fewer resources.
Final Thoughts
Our evaluation highlights that the choice between vLLM and TensorRT-LLM depends largely on specific application requirements and operational constraints.
It’s important to note that experiments in this article have a few limitations. First, results are from limited circumstances (e.g. default configurations or adjustments to single parameters). Both frameworks have a number of useful features such as chunked prefill and prefix caching which can improve all three metrics. Second, the dataset is very simple. All the samples have the same length and generate same number of outputs, therefore Inflight Batching (also known as continuous batching, which is key feature of both vLLM and TensorRT-LLM) is not fully exploitable. Third, parallelism like TP and PP is not considered since all benchmarks were done with single A100 card.
For efficient benchmarking, we used FitsOnChips, a toolkit designed for LLM benchmarking that supports precise configuration adjustments for different frameworks. FitsOnChips allows for fine-tuning each configuration and visualizing its impact on performance, enabling more efficient and informed benchmarking processes. If you are interested in this toolkit, find out more information here.
In the upcoming articles of this series, we will dive deeper into advanced optimizations, explore custom configurations, and assess additional use cases to provide a more exhaustive evaluation of these frameworks across diverse environments. We hope this comparisons serve as a foundation for practitioners to make informed decisions about their LLM deployment strategies.
Stay tuned for more insights in the [vLLM vs TensorRT-LLM] series!
Share article