[vLLM vs TensorRT-LLM] #5. Dynamic Sequence Lengths
This article provides a comparative analysis of vLLM and TensorRT-LLM frameworks, focusing on performance with fixed and dynamic datasets.
Oct 30, 2024
Introduction
A growing number of services powered by LLM serving systems are continuously emerging. In real-world applications, requests vary significantly in length, each operating under different constraints. In our previous post, we explored how schedulers in vLLM and TensorRT-LLM can influence the serving performance. We intentionally used controlled workloads with fixed length datasets to isolate the effect of specific serving parameters and observe how scheduler behaves under predictable conditions. By fixing both input and output lengths and ignoring end-of-sequence (EOS) token, generation proceeded predictably to each request's maximum output length, ensuring a clear baseline to evaluate fundamental performance factors.
Now, shifting to dynamic length datasets, we will explore how schedulers perform with complex workloads, where input lengths vary and generation often stops before reaching maximum output length. This dynamicity affects performance metrics like throughput and TPOT, as resource utilization fluctuates with dynamic input requests. In this post, we aim to reveal how dynamic sequence lengths influence each scheduler’s ability to manage fluctuating demands effectively, and to optimize for LLM serving.
Dynamic-Sonnet Dataset
Numerous datasets for testing LLMs are available, but we wanted to ensure we used the right dataset to accurately assess the performance impact of serving frameworks. Here, we introduce a new curated dataset called Dynamic-Sonnet, specifically designed to better evaluate the effects of dynamicity.
Fixed-length datasets in typical benchmarks enable controlled experiments and simplify analysis, but they fail to capture the effects of variable sequence lengths. Conversely, dynamic-length datasets demonstrate significant variability, though they can make result analysis challenging due to high diversity. Our goal was to combine the advantages of both by creating a dynamic dataset with a controlled distribution. To better simulate the dynamic behavior in both inputs and outputs, we created a new dataset inspired by the benchmarking approach of ray-project/llmval.
Dynamic-Sonnet contains four subsets— 1K, 2K, 4K and 8K, designed with varied token lengths to closely reflect real-world usage. Each prompt is a request of the selecting as many lines as possible from a series of Shakespearean sonnets. The number of sonnets included in each prompt is randomly chosen, aiming to achieve a total length that follows a normal distribution with a target mean for each subset.
Example data Pick as many lines as you can from these poem lines:\n FROM fairest creatures we desire increase,\n That thereby beauty's rose might never die,\n But as the riper should by time decease,\n His tender heir might bear his memory:\n …
Figure 1 and Table 1 illustrate the token length distribution for each subset of Dynamic-Sonnet dataset. Each NK subset is designed to ensure that the maximum prompt length does not exceed NK tokens. For instance, the average and maximum prompt lengths of 1K subset are around 512 tokens and 773 tokens, respectively, while those of 8K subset are 7,153 tokens and 7,709 tokens. This design enables in-depth testing of LLM serving systems, challenging each framework's (or scheduler's) capacity to handle dynamic workloads due to the unpredictable nature of token lengths.
Experiment Setup
We maintained most of the settings similar to those in previous posts. One change, however, was that we used Triton Server instead of the previously used C++ API for TensorRT-LLM. We configured the server-client setup to align with the OpenAI API interface in both vLLM and TensorRT-LLM.
Framework Version, Model and Hardware
- vLLM: 0.6.3.post1
- TensorRT-LLM: v0.13.0 release / Triton Server: v2.50.0
- Model: Llama-3.1-8B-Instruct (BF16)
- H/W: NVIDIA A100-PCIe 80G GPU, AMD EPYC 7643 48-Core Processor, 128 GB RAM
Dataset
- Dynamic: dynamic_sonnet_llama3 1K, 2K, 4K, 8K
- Fixed: random tokens with fixed input and output lengths corresponding to those of the dynamic dataset 1K, 2K, 4K, 8K
Note that the input lengths of the fixed dataset are not strictly 1K, 2K, 4K and 8K each. Instead, the input and output lengths of the fixed dataset are set to match the average lengths observed in the dynamic dataset. This approach balances computational load between the datasets and enables fair comparison. The numbers indicate that each subset corresponds to the same subset of the dynamic dataset.
Configurations
- Max batch size: 256
- Max number of tokens: 16384
- Request rate: Inf
Results
Experiment #1: Effect of Dynamic Sequence Length
To evaluate the impact of dynamic sequence lengths, we benchmarked each corresponding pair of fixed and dynamic datasets (e.g., Dynamic Sonnet 1K and Fixed 1K dataset) in identical environments. Since the output lengths of fixed datasets were set to match the average output lengths of the dynamic dataset, the total number of generated tokens were almost the same for each dataset pair.
Fixed datasets provide a stable environment for schedulers, with minimal variability allowing for predictable resource allocation and consistent performance. In contrast, dynamic datasets require the schedulers to continually adjust batch sizes and token counts, which significantly impacts hardware utilization. This experiment highlights the scheduler's crucial role in maintaining consistent performance despite token length variability—a significant challenge for optimizing LLM serving.
Figure 2 illustrates the throughput comparison of Fixed and Dynamic dataset benchmarks in vLLM and TensorRT-LLM. For shorter sequences, such as 1K or 2K, the throughput for the fixed dataset is noticeably higher than for the dynamic dataset. This difference can be attributed to distinct scheduling mechanisms in vLLM and TensorRT-LLM.
In TensorRT-LLM, requests are managed by the default GUARANTEED_NO_EVICT policy, which preallocates memory for the KV cache of each request based on the maximum output length (discussed in detail in our previous post). With dynamic datasets, the KV cache is allocated based on the possible maximum length, often resulting in memory waste as more memory is used than necessary. In contrast, with fixed datasets where output lengths are predictable, memory can be allocated precisely for the required KV cache. For this experiment, the fixed dataset’s output length is set to match the dynamic dataset’s average output length, allowing it to allocate less memory for the KV cache and thus support a larger running batch size than the dynamic dataset. This trend is further demonstrated in Figure 3.
The throughput difference becomes marginal with longer sequences, such as 4K and 8K. This is because, in these scenarios, the length of the prompt starts to dominate over the output length. While GUARANTEED_NO_EVICT allocates KV cache based on the maximum output length, the KV cache size for the prompt remains similar for both fixed and dynamic datasets, rendering the overhead of wasted memory negligible. Consequently, the active batch size for both fixed and dynamic benchmarks converges, as shown in Figure 3, resulting in similar overall throughput.
For vLLM, the scheduling policy is similar to the MAX_UTILIZATION strategy in TensorRT-LLM (which will be discussed in the next section). Therefore, the performance gains come from different factors. Unlike TensorRT-LLM, vLLM does not support mixed batching by default, so prefill and decode requests are batched separately. In fixed-length generation, the decode batch size tends to remain maximized, as all requests undergo the same number of iterations. With dynamic datasets, however, requests that generate an EOS token end earlier, resulting in smaller decode batch sizes for the remaining requests.
Similarly, in dynamic datasets, the prefill batch size (or recomputation batch for preempted requests) varies, leading to more iterations where the prefill batch size is smaller compared to fixed datasets. As shown in Figure 3, this reduction in average running batch size becomes more evident when we look at the average batch size across all iterations (including prefill iterations) rather than only decode iterations. Overall, TensorRT-LLM shows greater resilience in dynamic scenarios than vLLM, as it natively supports mixed batching.
Meanwhile, Figure 4 illustrates the TPOT trend, highlighting some noteworthy aspects. As discussed in previous posts, TPOT generally correlates with average batch size, so we would expect TPOT to decrease with longer sequences as the average batch size typically shrinks. However, we observe that TPOT actually increases as input sequence length grows, because longer input sequences lead to a larger KV cache size, increasing memory overhead and attention computation. The combined effects of reduced batch size and increased sequence length compete, resulting in the above illustrated trend.
Another interesting point is that TPOT for the fixed dataset is lower than that for the dynamic dataset. This can be attributed to the fact that vLLM does not support mixed batching. In the case of a fixed dataset, all requests have the same length, allowing for a clear separation between prefill and decode iterations. However, with a dynamic dataset, some requests end generation earlier than others within the same decode batch. This creates additional budget for prefill scheduling, increasing the likelihood of prefill iterations from other waiting batches interleaving throughout the generation. This additional prefill can slow down the average TPOT for dynamic datasets.
Experiment #2: Effect of Scheduling Policy
In our second experiment, we evaluated the performance of two TensorRT-LLM scheduler policies, GUARANTEED_NO_EVICT and MAX_UTILIZATION, using a dynamic dataset. With GUARANTEED_NO_EVICT, KV cache memory for each request is preallocated, ensuring that scheduled requests are guaranteed NOT to be preempted due to the memory constraints. In contrast, MAX_UTILIZATION policy allocates KV cache memory on-demand during output generation, packing as many requests as possible for each iteration at the risk of some requests being preempted if memory is insufficient. This difference is especially noticeable with dynamic inputs, where actual output lengths are typically shorter than the maximum output length.
To demonstrate this, we used a 4K subset of Dynamic Sonnet, adjusting the maximum output length from 1K to 4K. Additionally, token ID 13(token ID for period, “.”) was set as the EOS token to end generation as early as possible, further widening the gap between the maximum output length and the actual output lengths.
Figure 5 presents the throughput and average running batch size for each policy. When the maximum output length is 1K, GUARANTEED_NO_EVICT shows slightly higher throughput. Here, the average batch size for both GUARANTEED_NO_EVICT and MAX_UTILIZATION is similar since the KV cache allocated by GUARANTEED_NO_EVICT is sufficient to accommodate requests until the early termination of generation due to short max output length. Additionally, MAX_UTILIZATION may incur extra latency overhead due to its complex scheduling features, including non-coalesced KV cache allocation— making the throughput slightly lower compared to simpler GUARANTEED_NO_EVICT. However, as the maximum output length increases, MAX_UTILIZATION surpasses GUARANTEED_NO_EVICT in throughput, with the performance gap widening. This trend is also visible in the average running batch size, where MAX_UTILIZATION consistently achieves a larger batch size across all cases. By reducing KV cache memory usage, MAX_UTILIZATION frees up space to batch more requests. In general, MAX_UTILIZATION delivers higher throughput in dynamic scenarios but should be applied carefully, as excessive preemption could reduce throughput in certain cases.
Final Thoughts
In this post, we examined the performance metrics in dynamic scenarios, comparing them with the results from fixed scenarios. Overall, we observed a performance decrease in dynamic environments for both vLLM and TensorRT-LLM, with non-deterministic output having a greater impact than input dynamicity. It highlights the importance of measuring performance with dynamic requests to reflect real-world serving environments properly.
In future articles, we will explore various advanced features for real-world LLM deployments, starting from various quantization methodologies. We hope this series helps practitioners fully leverage the potential of LLM servings.
Stay tuned for more insights in the vLLM vs TensorRT-LLM series!
Share article
Join the SqueezeBits newsletter today!