[vLLM vs TensorRT-LLM] #2. Towards Optimal Batching for LLM Serving
This article provides a comparative analysis of vLLM and TensorRT-LLM frameworks, focusing on batching configurations and thoroughly examining the effects of maximum batch size and maximum number of tokens.
Oct 11, 2024
Introduction
In our previous article, we compared vLLM and TensorRT-LLM under default configurations and specific constraints, providing insights into their baseline performance. However, relying on default settings or adjusting just a single parameter is not enough to fully exploit the capabilities of these frameworks, especially in complex real-world environments.
In this article of our series, we explore deeper by tuning key parameters such as maximum batch size and maximum number of tokens. We will adjust these parameters step by step to investigate how they impact the performance of each framework. This will help us identify the optimal batching configurations for the best performance of both vLLM and TensorRT-LLM, showcasing their strengths and weaknesses over a wider range of scenarios.
Two-phased Text Generation
Before diving into the key parameters, let's break down the two phases of text generation: the prefill phase and the decode phase. In the prefill phase, the model processes all input tokens to create context and produces the first output token, which then is used to generate subsequent output tokens. This is followed by the decode phase, in which the model generates output auto-regressively, using the context built in the prefill phase along with previous outputs.
All input tokens are fed into the model simultaneously in the prefill phase. This makes the prefill phase computationally intensive. On the other hand, in the decode phase, only the most recently generated token is fed into the model. The previous context is loaded from the KV cache to reduce redundant computations. Loading KV caches induces significant memory transfer costs, making the decode phase memory-bound. Because the two phases have distinct characteristics, key parameters affect each phase differently.
Key Parameters for Batching Configuration
Maximum Batch Size
The maximum batch size, called
max_num_seqs
in vLLM and max_batch_size
in TensorRT-LLM, defines the maximum number of requests that can be processed simultaneously. Larger batch sizes allows more tokens to be generated in parallel, increasing throughput. However, increasing batch size can degrade TPOT and require more memory for KV caches and activations. Maximum Number of Tokens
The maximum number of tokens, referred to as
max_num_batched_tokens
in vLLM and max_num_tokens
in TensorRT-LLM, limits the number of tokens processed per iteration. Increasing this value typically improves throughput by accomodating longer sequences and larger batches. Increasing size of activation tensors helps utilizing the hardware’s computational resources better. However, increasing this value can lead to longer TTFT. Therefore, it’s important to find the optimal value.Why Do We Need Both?
When multiple requests come, the scheduler batches requests based on those two parameters. Note that the details of how the scheduler works will be covered in the following post of our series. In the prefill phase, batch size is usually limited by max number of tokens, as the number of input tokens per request is large. On the other hand, max number of tokens usually don't play a deterministic role in the decode phase, because the number of input tokens is small and it is difficult to reach max number of tokens. Instead, max batch size plays a critical role limiting the batch size in decode phases and balancing between TPOT and throughput. Theoretically, the batch size with the lowest TPOT in the throughput saturation range is the optimal value for max batch size. Therefore, both parameters are essential for optimizing performance throughout the whole generation process. Throughout this article, max batch size and max number of tokens are used for each parameter for consistency.
Experiment Setup
In this post, we focus on two key parameters, max batch size and max number of tokens.
Max Batch Size
- variation: 4, 8, 16, 32, 64, 128, 256, and 512
Max Number of Tokens
- variation: 1024, 2048, 4096, 8192, and 16384
By adjusting these parameters, we assessed their impact on throughput, TTFT, and TPOT for both frameworks across different scenarios.
Benchmark Dataset
To ensure a fair comparison between vLLM and TensorRT-LLM, we used datasets with fixed input and output lengths to maintain consistency in the number of processed tokens. Also, given the distinct characteristics of the prefill and decode phases, we designed two datasets tailored for each phase:
- Prefill-Heavy Dataset: Contains 1,024 samples, each with an input length of 768 tokens and an output length of 128 tokens. This dataset focuses on the prefill phase, emphasizing longer input length than output length.
- Decode-Heavy Dataset: Contains 1,024 samples, each with an input length of 128 tokens and an output length of 768 tokens. It targets the decode phase, where the model generates a larger number of output tokens.
The sequence lengths were selected to allow flexibility in varying max number of tokens, starting from 1024 since a max number of tokens value must exceed the input length to accommodate all samples without exceeding the maximum token capacity.
Framework Version
We selected recent versions of both frameworks that successfully completed the benchmarking process:
- vLLM: v0.6.2
- TensorRT-LLM: 0.14.0.dev2024092401 with C++ API
Model and Hardware
- Model: Llama-3–8B (BF16)
- H/W: NVIDIA A100-SXM 80G GPU, 32 vCPU 125 GB RAM
Results
To thoroughly evaluate the effect of key parameters, we varied max batch size and max number of tokens in both datasets under different request rate conditions. Each experiment was assessed with three critical LLM metrics: throughput, Time-to-First-Token (TTFT), and Time-Per-Output-Token (TPOT). Definitions of each metric are explained in our previous post. From the results across various request rate conditions, we will highlight specific request rates for each metric to better focus on the trends of the key parameters. We used infinite request rate to measure throughput and TPOT. While, we used request rate 4 for TTFT, as TTFT with an infinite request rate goes far beyond the range of real-world scenarios.
Scenario #1. Prefill-heavy
We first evaluate the effects of both max batch size and max number of tokens with a prefill-heavy scenario. In this section, we focus on their impact specifically on the prefill phase.
Throughput Results
First of all, throughput increases as max batch size gets larger in both frameworks, as shown in Figure 2. This tendency is expected, because more tokens can be generated simultaneously with a larger batch size. However, after a certain threshold, it reaches a saturation point due to computational limitations, and may even show a slight decrease. It indicates that simply increasing max batch size is not always the right solution.
While auto-regressively generating output tokens, the slots for the KV cache can be insufficient sometimes. In such cases, the LLM engine can preempt lower-priority requests, and the generation of these preempted requests will resume later. There are two strategies to handle preemption: recomputation and swap. With the recomputation method, the KV caches of the preempted requests are dropped and recomputed later when the generation resumes. With the swap strategy, the KV caches are swapped out to the host memory and swapped back in when possible. In both methods, preemption can cause a slowdown in overall performance, so it’s important to monitor the occurrence of preemption and adjust the parameters to avoid it if necessary.
Throughput also improves when max number of tokens is increased. This improvement comes from the prefill phase, as the prefill batch size becomes larger with a higher max number of tokens value. From both results, it can be observed that the framework showing better throughput varies depending on the configuration. In general, TensorRT-LLM achieves higher throughput with larger batch sizes, while vLLM is faster with smaller batch sizes.
TTFT Results
TTFT is highly related to request rate, as in our previous post. In this article, we will focus more on the effect of batching by tuning max batch size and max number of tokens. As mentioned above, a request rate of 4 is used for this experiment.
TTFT reaches saturation after a few small max batch size values. The huge TTFT at small max batch size is due to the queueing time of subsequent requests caused by limited throughput. An explanation of the queueing time is also covered in our previous post. As throughput is very low with small max batch size, requests that are fed in while previous requests are in the decode phase suffer a long delay until their prefill phase begins.
As shown in Figure 5, max number of tokens does not seem to have a significant influence on TTFT. However, this might be unexpected, considering that max number of tokens determines the prefill batch size. This phenomenon is also due to the queueing time. As the prefill batch size increases with larger max number of tokens, the latency for each prefill iteration becomes longer. However, the prefill throughput improves, reducing the queueing time for subsequent requests. Comparing the results between the frameworks, TensorRT-LLM consistently demonstrated faster TTFT across different max number of tokens values.
TPOT Results
As shown in Figure 6, max batch size is highly influential to TPOT. TPOT increases as max batch size grows . Although TPOT gets worse, throughput increases with larger batch sizes, as seen in the throughput results. vLLM exhibits better TPOT until a certain point where TPOT of TensorRT-LLM saturates. TensorRT-LLM shows similar saturation points to those observed in the throughput results.
To further explain the saturation of TPOT, we evaluated the average running batch size from TensorRT-LLM benchmarks. Running batch size refers to the actual batch size of input tensors after preemption, which directly affects TPOT. It can be observed that running batch size values of the saturated points are almost identical.
In contrast, the TPOT of vLLM did not reach its saturation point because its average running batch size is still increasing. Although the batch size continues to grow, throughput becomes saturated at a max batch size of 256, as it is already computationally bound. The different tendency of running batch size between vLLM and TensorRT-LLM arises from their different scheduling methodology. This will also be covered in our vLLM vs TensorRT-LLM series later.
In both frameworks, TPOT decreases slightly as max number of tokens gets bigger. When max batch size is 256 and max number of tokens is 1024, TensorRT-LLM shows atypically low TPOT due to the smaller average running batch size compared to others (see Figure 7).
The decreasing trend in TPOT is unexpected, as max number of tokens is a dominant factor for the prefill phase. This phenomenon is due to delays between the first and subsequent output tokens in some requests. In most cases, the decode batch size is larger than the prefill batch size, requiring multiple prefill iterations to fill the subsequent decode batch. If the result of the first prefill batch is sent back immediately, these requests suffer a delay before generating their next token due to the remaining prefill iterations. This leads to high TPOT for some requests, causing the maximum TPOT to be much larger than the minimum TPOT, as shown in Table 1. While increasing max number of tokens, prefill throughput gets faster, which reduces the effect of the delays, ultimately leading to a decrease in TPOT.
Scenario #2. Decode-heavy
In the decode-heavy scenario, the overall tendencies of each parameter are mostly similar to the prefill-heavy results, so in this section, we will focus more on the cases where results from the two scenarios differ.
In the prefill-heavy scenario, we have observed that larger max number of tokens results in higher throughput. However, in the decode-heavy scenario, the influence of max number of tokens is hidden by long decoding iterations.
In contrast, we can observe that max batch size value is more influential in the decode-heavy scenario, as shown in Figure 10. While throughput of prefill-heavy benchmark was increased by about 10x when max batch size was increased to 512 from 4, throughput has been increased by about 30~40x for decode-heavy benchmark in same situation.
In the prefill-heavy scenario, a max batch size of 16 was sufficient to reach TTFT saturation. This saturation point marked the threshold where queuing time was eliminated due to the high throughput. In contrast, in the decode-heavy scenario, a max batch size of 128 was required (Figure 11), as the longer output length demanded higher throughput to offset the queuing time delay.
Final Thoughts
In this post, we have reviewed the impact of key parameters, max batch size and max number of tokens, that are closely related to how requests are batched on vLLM and TensorRT-LLM. The results show that adjusting the parameters significantly affects the performance on both vLLM and TensorRT-LLM.
Every service has its own priority. If a service prioritizes throughput, increasing both max batch size and max number of tokens until saturation can be generally beneficial. However, it's essential to find the optimal values that prevent preemption, taking into account factors like memory capacity and other configurations like sequence length. If a service prioritizes TTFT, sufficient throughput should be ensured by setting max batch size to eliminate queueing delays. In cases where TPOT is the priority, tuning both parameters is necessary to strike a balance between TPOT and throughput.
While preparing this article, we conducted a vast number of experiments, as seen in Figure 12, to fully understand how each parameter and environmental variable interacts. However, there are more to explore: any changes in the service scenario or model will require additional experimentation. This is why we developed Fits on Chips to simplify this process. This toolkit enables efficient analysis and performance tuning, ensuring that continuous experimentation is manageable and effective.
In future articles, we will explore advanced optimizations, custom configurations, and additional use cases to provide a more comprehensive evaluation of these frameworks across diverse environments. We hope this comparison assists practitioners in making informed decisions about their LLM deployment strategies.
Stay tuned for more insights in the vLLM vs TensorRT-LLM series!
Share article
Join the SqueezeBits newsletter today!