[vLLM vs TensorRT-LLM] #4. Which Scheduler Wins? 🔥
This article provides a comparative analysis of schedulers in vLLM and TensorRT-LLM frameworks.
Oct 24, 2024
Contents
IntroductionScheduling BasicsStatic Request-Level SchedulingIteration-Level SchedulingPacked BatchingContinuous Batching (In-flight Batching)Memory-aware SchedulingPreemptionHow TensorRT-LLM Schedules RequestsHow vLLM Schedules RequestsExperiment SetupFramework Version, Model and HardwareBenchmark DatasetResultsFinal ThoughtsIntroduction
The era of Transformers and Large Language Models (LLMs) is flourishing. Beyond the evolution in model architecture, workloads have become increasingly dynamic, making system-wide optimization just as crucial as model-level optimization. In particular, the way requests are scheduled and batched has emerged as a key factor in determining serving performance.
While vLLM and TensorRT-LLM have several differences, one of the most notable distinctions is in their schedulers. Optimized request batching and management are the key to improving performance and lowering costs, especially with the constantly changing demands on computations and memory. For this reason, both vLLM and TensorRT-LLM incorporate dedicated scheduler components to effectively manage user requests.
In this post, we will explore how these schedulers function and how they affect performance and resource utilization.
Scheduling Basics
Static Request-Level Scheduling
Let's start with a naive form of scheduling: static request-level scheduling. In this method, incoming requests are grouped into batches as they arrive and processed together. All requests in a batch are processed in parallel, so any new request added to the next batch will be handled after all the requests in the current batch have been completed.
While straightforward, this method leads to inefficiencies, particularly when individual requests have different input and output lengths. Requests with shorter sequences can experience significant delays because all requests in the batch must wait for the longest-running request to complete.
Iteration-Level Scheduling
Iteration-level scheduling was introduced to address the limitations of the static request-level scheduling. This approach breaks tasks into smaller units called "iterations" rather than scheduling entire requests. In auto-regressive models, an iteration is typically defined as generating a single token. By scheduling at this finer granularity, iteration-level scheduling minimizes wasted resources and improves hardware utilization.
Iteration-level scheduling can significantly improve computational efficiency, as requests often have varying token lengths. Requests that finish early allow new ones to be added to the batch, instead of waiting for the entire batch to complete. This approach reduces idle time for hardware resources and improves overall throughput, especially in workloads with varying token counts across requests.
Figure 2 illustrates how the scheduler handles multiple requests with varying output lengths. In this example, six requests (A to F) are scheduled with a batch size of 2. In the beginning, request A and B are batched together and processed simultaneously. Then, request B finishes much faster than request A. With iteration-level scheduling, request C begins as soon as request B finishes, without waiting for request A to finish. We can observe that there is no idle waiting between requests, and as a result, the overall computation time is reduced.
Packed Batching
Packed batching is another crucial component for efficient execution of scheduled requests, even though it is not a scheduling technique on its own. As shown in Figure 2, iteration-level scheduling often requires both prefill and decode stages to be processed within the same iteration. The two phases differ significantly in terms of input token sizes: the prefill stage processes multiple tokens at once, while the decode stage handles only one token at a time, which is the output token from previous iteration (as discussed in our previous post). This discrepancy in token length results in substantial padding when a request in the prefill stage and a request in the decode stage are grouped in the same batch. Padding also occurs even when all requests in a batch are in the prefill stage, as they often have different input token lengths. This padding degrades computational efficiency, as the extra computations for the padding are essentially wasted.
Packed batching addresses this issue by concatenating inputs along the sequence dimension instead of the batch dimension, eliminating unnecessary padding and improving hardware utilization. Packed batching is straightforward in most LLM layers as they are sequence-length agnostic. However, one of the core components of LLM models — the attention layer — poses a challenge. Each request requires different attention pattern, so attention must be computed individually for each request. As a result, packed batching must compute the attention for different requests by slicing the concatenated inputs accordingly.
Here’s a simplified version of how packed attention works:
# naive pseudo code of packed attention # see https://github.com/vllm-project/vllm/pull/3236 for detail function packedAttn(Q, K, V, lens): out= empty_like(Q) s = 0 for â„“ in lens: e = s+â„“ q = Q[s : e] k = K[s : e] v = V[s : e] out[s: e] = Attn(q, k, v) s = e return out
As shown in Figure 3, the packed requests are sliced for the attention layers and concatenated again along the sequence dimension after the computation. Slicing requests for the attention layers does introduces some overhead, but the benefit of eliminating unnecessary padding often outweighs it. As a result, this approach can significantly enhance serving performance by improving hardware utilization.
Continuous Batching (In-flight Batching)
By integrating the iteration-level batching and packed batching, we arrive at the core of vLLM and TensorRT-LLM schedulers: continuous batching (also known as in-flight batching). This approach aims to minimize queue wait times and reduce padding overhead, resulting in better hardware utilization and serving performance.
Both vLLM and TensorRT-LLM fundamentally share this same scheduling policy. However, they differ in the specifics, particularly regarding memory management. These differences are key factors contributing to the performance variations observed between the two frameworks. One major factor is the management of the KV cache, which plays a critical role in determining how efficiently requests can be scheduled. In the following section, we'll take a closer look at how KV cache management affects scheduling and overall performance.
Memory-aware Scheduling
In previous post, we discussed two key parameters that determine how requests are batched: max batch size and max number of tokens. These parameters determine how many requests can be batched: in some cases, the batch is limited by max batch size, while in others, it is constrained by max number of tokens.
There’s another constraint that we haven’t addressed yet: the size of the KV cache. A request cannot be scheduled if there is not enough remaining KV cache to store its context. This poses a significant challenge because, in most LLM serving scenarios, the memory required for the KV cache often exceeds the memory needed to load the model itself. While the memory limitation can be harsh, recomputing the whole KV cache for each output token must be avoided as it will introduce a prohibitive computational overhead. As a result, even when the batch size or token count are not the limiting factors, the size of the remaining KV cache can limit the number of requests to be batched together.
However, unlike other limiting factors, managing the size of KV cache is not deterministic — it grows with each generated token and can expand up to the maximum output token length. As a result, managing the remaining KV cache involves certain level of estimation. Similar to other estimation challenges, we can approach this either pessimistically or optimistically when allocating the KV cache. These two strategies are known as preallocation and on-demand allocation, respectively.
In preallocation, the memory space for KV cache is reserved based on the sum of input tokens and the maximum number of tokens to generate once a request is scheduled. This approach ensures that there will be no memory shortage during the decoding stage, as the memory required for the maximum KV cache size is already allocated. TensorRT-LLM uses preallocation as its default policy, known as GUARANTEED_NO_EVICT.
Preallocating memory for the maximum KV cache size can lead to inefficient memory usage as not all requests generate up to their maximum token limit. In contrast, on-demand allocation dynamically allocates KV cache memory as tokens are generated, rather than reserving the maximum upfront. This approach is the only policy of vLLM and behavior of alternative policy in TensorRT-LLM which is referred to as MAX_UTILIZATION. This strategy helps minimize memory waste and allows for larger batch sizes, but it introduces the risk of running out of KV cache, known as preemption.
Preemption
Preemption can occur when the context length of requests in the active batch grows as more text is generated, requiring additional on-demand KV cache allocation, and there is insufficient available KV cache memory. , In such case, some requests in the batch must be preempted, and their KV caches must be evicted to free up memory and prevent deadlock. Eviction can be done in one of two ways: swapping the KV cache to host memory or dropping the cache entirely. Swapping involves transferring the KV cache to host memory when a request is preempted and loading it back when the request is resumed. Dropping, on the other hand, discards the KV cache and stores the context along with the generated tokens. When the request is resumed, the KV cache is recomputed during the prefill stage.
Dropping is often preferred over swapping, as host memory read and write operations introduce significant overhead. In contrast, dropping only requires a single prefill iteration where previously generated tokens are concatenated with the original input tokens, making it a more efficient option in most cases.
How TensorRT-LLM Schedules Requests
Since TensorRT-LLM contains proprietary code, its exact scheduling policy cannot be directly determined from the source. However, based on careful observation, it appears that TensorRT-LLM adopts the continuous batching approach with few, if any, modifications. While the source code is not publicly available, we can infer this by analyzing the request patterns over iterations. Figure 6 illustrates the number of requests being processed per iteration under the default scheduling policy, GUARANTEED_NO_EVICT, for a sample workload.
Note that each requests take 512 iterations as output length is set to 512. We can observe that new requests are added as soon as previous ones complete. This scheduling behavior demonstrates the core principle of continuous batching, specifically iteration-level scheduling, as new requests are introduced immediately after the finished requests.
When the policy is changed to MAX_UTILIZATION, the behavior slightly changes.
We can now observe the presence of preemptions. Preempted requests are removed from the schedule and later resumed in later iterations. Despite the preemptions, the continuous batching scheme remains evident in Figure 7.
How vLLM Schedules Requests
Unlike TensorRT-LLM, vLLM's scheduler is fully transparent, as its codebase is open-source. vLLM also adopts iteration-level scheduling , which is the core component of continuous batching. However, it adds two distinct changes: no mixed batching and prefill prioritization.
Currently, vLLM does not use mixed batching by default. This means that prefill requests are only batched with other prefill requests, and decode requests are only batched with other decode requests. This design simplifies the computational path, as each batch processes the same stage. Without mixed batching, one additional strategy must be employed by the scheduler: prefill requests must be prioritized. Let’s consider a scenario where a request in the current batch finishes, and there is a new request waiting in the request pool to be batched. Without mixed batching, the new request cannot be added to the current batch, because it first needs to go through the prefill phase before entering the decode phase. As a result, it cannot be batched with the decode requests currently being processed.
This breaks the concept of continuous batching. To resolve this, the current batch of decode requests must be postponed, and the prefill request must be processed first to maintain the flow of continuous batching. Therefore, prefill requests must be prioritized to ensure there are enough requests in the decode phase during the subsequent decode iterations.
Note that vLLM has limited support of mixed batching when chunked prefill is enabled. This approach may allow for a scheduling behavior similar to the MAX_UTILIZATION policy of TensorRT-LLM. However, mixed batching is not available without chunked prefill enabled.
Experiment Setup
We designed experiments to compare different scheduling policies with vLLM and TensorRT-LLM. We mainly compared TensorRT-LLM with GUARANTEED_NO_EVICTION policy, TensorRT-LLM with MAX_UTILIZATION policy, and vLLM.
Framework Version, Model and Hardware
- vLLM: v0.6.2
- TensorRT-LLM: 0.14.0.dev24092401 (C++ API)
- Model: Llama-3-8B (BF16)
- H/W: NVIDIA A100-SXM 80G GPU, Intel Xeon(R) 2.20GHz (12 cores), 128 GB RAM
Benchmark Dataset
For the benchmark dataset, we used both prefill-heavy and decode-heavy dataset with varying sequence lengths to assess performance under different conditions.
- NK prefill-heavy: NK input tokens and 1K output tokens
- NK decode-heavy: 1K input tokens and NK output tokens
Maximum sequence length was set to (N+1)K for all experiments. Max batch size and max number of tokens were set to 256 and 16384, respectively. Request rate was set to infinite.
Results
To analyze the scheduler's behavior, we first evaluate the average running batch size before looking at end-to-end performance. The scheduler, by determining how many requests are processed per iteration, reveals key differences between vLLM and TensorRT-LLM.
Despite some differences like mixed batching or prefill prioritization, both vLLM and TensorRT-LLM with the MAX_UTILIZATION policy exhibit similar trends in how average batch size decreases with increasing sequence length (as seen in Figure 8). This is due to memory constraints on the KV cache, leading to preemption as sequence length grows.
When comparing GUARANTEED_NO_EVICT with the other two cases, the average batch size for the former is consistently smaller. This is because preallocation of the KV cache in the GUARANTEED_NO_EVICT policy prevents aggressive batch size scaling. Yet, this strategy ensures no preemption, which may lead to better throughput in the end.
Typically, larger batch sizes lead to higher throughput but worse TPOT. However, Figure 9 and 10 shows that this isn't always the case.
In the prefill-heavy scenario where the output length is limited to 1k, TensorRT-LLM with GUARANTEED_NO_EVICT achieves the highest throughput. This is because the impact of preemption on performance outweighs the gains from increased batch size with the output constrained to 1k, as the total number of decode steps was too small for the increased batch size to make a significant difference.
In the decode-heavy scenario, however, the benefits of increased batch size become more apparent as the number of decode steps gets larger. In this case, TensorRT-LLM with MAX_UTILIZATION delivers higher throughput compared to TensorRT-LLM with GUARANTEED_NO_EVICT.
TPOT data also suggests that the common notion—that a longer sequence length or a larger batch size leads to worse TPOT—can be questioned. When comparing TensorRT-LLM with MAX_UTILIZATION and GUARANTEED_NO_EVICT under prefill-heavy datasets, this assumption seems valid: MAX_UTILIZATION shows higher TPOT than GUARANTEED_NO_EVICT. However, when looking at different input lengths within MAX_UTILIZATION, the behavior is less consistent. This is because the impact of longer sequence lengths can outweigh the effect of a reduced batch size from certain input sequence length.
We can observe an interesting trend in decode-heavy datasets, especially for TensorRT-LLM with GUARANTEED_NO_EVICT. In this case, TPOT decreases as output length increases, despite the longer context. This happens because GUARANTEED_NO_EVICT reduces the batch size even in the early decoding steps. In contrast, TensorRT-LLM with MAX_UTILIZATION and vLLM show a more straightforward trend. They miss the opportunity for reduced TPOT in the early decoding steps, when the sequence length is still short, due to the large batch sizes scheduled by their aggressive memory management policies.
Final Thoughts
In this post, we explored scheduling policies in vLLM and TensorRT-LLM, analyzing their impact on performance using fixed-length benchmarks. Our findings highlight that the effect of scheduling can vary depending on the scenario.
However, we were unable to fully capture the differences in memory allocation since we used fixed number of tokens for output lengths. A smarter scheduler shines when the sequence lengths of requests vary significantly. In such scenario, the ability to dynamically adjust batching and resource allocation becomes crucial. Benchmark results would likely differ if we used a dynamic-length dataset. For example, adjusting the length distribution from a uniform to a normal distribution in a fixed-length benchmark shows different trends.
This highlights the importance of dynamic-length benchmarks. To truly assess the value and impact of different schedulers, we need to simulate more realistic scenarios. In our upcoming post, we’ll dive deeper into this topic and explore how results change under dynamic-length conditions.
Share article
Join the SqueezeBits newsletter today!