[vLLM vs TensorRT-LLM] #11. Speculative Decoding
This article provides a comparative analysis of speculative decoding.
Dec 09, 2024
Contents
IntroductionWhat is Speculative Decoding?Key Considerations of Speculative Decoding1. Acceptance Ratio2. Verification Batch Size3. Memory AllocationExperiment SetupBenchmark DatasetModel and Hardware SpecificationFramework VersionResultsImpact of the Number of Draft TokensImpact of Input LengthImpact of Draft Model Selection and Acceptance RatioLimiting Batch Size for Speculative DecodingFinal ThoughtsIntroduction
Recently, many AI hardware startups have been launching API services powered by their proprietary hardware. To showcase the superiority of their technologies, these companies are offering services at lower costs or publishing reports highlighting faster token generation speed (TPOT, Time-Per-Output-Token). In this fierce competition for speed and cost-efficiency, one particular technique has gained significant traction, delivering remarkable improvements in latency: speculative decoding. For instance, Groq reported achieving over a 6x speed-up in serving the Llama-3.1-70B model using speculative decoding, while SambaNova noted more than a 2x improvement in serving the Llama-3.1-405B model compared to other API services. This article explores speculative decoding, its implementation in the vLLM and TensorRT-LLM frameworks, and experimental results demonstrating its strengths and limitations.
What is Speculative Decoding?
Speculative decoding is an advanced inference optimization technique designed to accelerate the text generation process of LLMs without compromising the output quality. At its core, this approach leverages a smaller, faster model called the draft model alongside a larger, more accurate model called the target model. Among several speculative decoding strategies, such as self-speculating, lookahead, and Medusa, this article focuses on the classical Draft-Target method, which forms the foundation of many speculative decoding techniques and is implemented in both vLLM and TensorRT-LLM.
The Draft-Target method operates in two main stages: proposal stage and verification stage (see Figure 1). In the proposal stage, the draft model takes the input context and generates a set of token candidates (draft tokens). The draft model predicts several tokens in a single stage through an auto-regressive process, effectively "speculating" on how the sequence should progress. Since the draft model is often much smaller and faster than the target model, this speculation process is much efficient than running the target model.
Once the draft tokens are generated, the target model steps in to evaluate their validity (verification stage). The larger target model scores the proposed draft tokens in parallel by calculating their probabilities given the same context. If the target model assigns high probabilities to the draft tokens (indicating that they align with its more accurate understanding), those tokens are accepted. However, if even one token in the sequence is deemed unlikely, that token and all subsequent tokens are rejected. If a draft token is rejected during verification, it implies that the target model has determined a more likely token based on its deeper understanding. In such cases, the rejected draft token is replaced with the token the target model considers more likely, ensuring the sequence remains valid. This process inherently allows the target model to decode at least one token during each verification stage, which is referred to as the "bonus token." Additionally, when all draft tokens are accepted, the last token generated by the target model during verification can be used for the next token, further justifying the term "bonus token."
By offloading most of the token generation to the draft model--while leveraging the target model's superior accuracy for validation--speculative decoding reduces the computational burden and improves throughput, leading to faster text generation without sacrificing accuracy. This makes speculative decoding particularly useful in scenarios where token generation speed is critical such as real-time API services.
Key Considerations of Speculative Decoding
To utilize speculative decoding effectively, several key factors must be carefully managed. Among the controllable variables, the most impactful are the number of tokens and draft model selection. These two variables directly influence key metrics like the acceptance ratio, verification batch size, and memory utilization, which collectively determine the performance of speculative decoding.
- Number of Draft Tokens
The number of draft tokens is a user-defined parameter that determines how many tokens the draft model generates in a single stage, or equivalently, how many tokens are submitted to the target model for verification. This variable has a cascading effect on speculative decoding's behavior, influencing both the acceptance ratio and verification batch size.
- Draft Model Selection
The choice of draft model significantly impacts speculative decoding's performance by determining the speed and quality of token proposals. Draft models are smaller and faster than the target model, making them computationally efficient for speculative decoding. However, their size and capability mush be carefully considered. Smaller draft models typically generate tokens quickly but may produce lower-quality outputs.
1. Acceptance Ratio
The acceptance ratio is a measure of how often the target model approves the tokens proposed by the draft model. It is a critical factor in speculative decoding, as a higher acceptance ratio ensures minimal redundant computations, leading to faster generation and reduced latency. Conversely, a low acceptance ratio necessitates frequent token regeneration, diminishing the overall benefits of speculative decoding. Several factors influence the acceptance ratio, with the most important being the number of draft tokens and the selection of the draft model.
When increasing the number of draft tokens, it becomes possible to process more tokens in a single step, potentially boosting throughput. However, as the number of draft tokens grows, the acceptance ratio often decreases, leading to redundant computations and negating performance gains (see Figure 2). Finding the optimal number of draft tokens often requires empirical experimentation, as it heavily depends on the specific serving environment.
Choosing the right draft model is also important. The draft model should be significantly smaller and faster than the target model to deliver a meaningful performance boost. However, if the draft model is lacks sufficient capacity, it may generate low-quality tokens that are frequently rejected by the target model. Therefore, selecting a draft model that balances speed and token quality is critical to maximizing the benefits of speculative decoding.
2. Verification Batch Size
The verification of tokens proposed by the draft model is one of the most resource-intensive steps in speculative decoding. During this process, the target model evaluates the probabilities of the draft tokens within the given input context. To maximize efficiency, the verification of multiple draft tokens is processed in parallel by increasing the number of queries in the batch dimension by
num draft tokens+1
. This parallelization enables speculative decoding to handle multiple token proposals simultaneously, reducing end-to-end latency. For example, as illustrated in Figure 3, each speculative token adds an additional query to the target model's inference workload. As the number of draft tokens increases, the effective batch size grows proportionally, leading to a more rapid rise in computational overhead per iteration compared to standard decoding. This means that speculative decoding's performance gains diminish as batch size grows, especially when operating at large scales. Therefore, careful tuning of the number of draft tokens and batch size is necessary to prevent the diminishing returns associated with oversized verification workloads.
3. Memory Allocation
Memory allocation is another key factor influenced by selection of the draft model. Since the draft and target models work collaboratively in speculative decoding, GPU memory must be carefully allocated to both models, including their respective Key-Value (KV) caches.
Figure 4 illustrates how GPU memory is allocated when the target model is configured with Tensor Parallelism (TP) 4 and the draft model with TP=1. In standard (vanilla) decoding, only the target model is deployed, with its weights and buffers evenly distributed across the four GPUs. The remaining GPU memory on each device is used entirely for the target model's KV cache, maximizing memory utilization.
In speculative decoding, however, additional memory must be allocated for the draft model. Because the draft model is smaller and configured to run with TP=1, it is assigned to only one of the four GPUs-in this example, GPU 0. On GPU 0, memory is allocated for both the draft model's weights and buffers as well as its KV cache. Consequently, the memory available for the target model's KV cache on GPU 0 is reduced. Since Tensor Parallelism requires equal workloads across all participating GPUs, this reduced KV cache allocation on GPU 0 becomes the limiting factor for the other GPUs. GPUs 1, 2, and 3 are forced to match the reduced KV cache size of GPU 0, leaving a significant portion of memory on these GPUs unused. This imbalance in memory utilization highlights the importance of carefully considering the draft model's size and memory requirements when configuring with speculative decoding.
Experiment Setup
Benchmark Dataset
To evaluate the efficiency and performance of speculative decoding, we conducted experiments using a carefully curated dataset rather than a dataset with random tokens. If both draft model and target model take random tokens as input, they will also generate meaningless tokens, leading to meaningless draft-and-verify process. Therefore, we used the Dynamic-Sonnet dataset for all experiments. For the output length, we used 128 tokens as default to make fair comparison between different configurations. Detailed explanations and examples of the dataset can be found in our previous article and on huggingface.
Model and Hardware Specification
Speculative decoding requires both the target and draft models to share the same tokenizer. There exist special techniques that enable using models with different tokenizers, but for now both vLLM and TensorRT-LLM requires using the same tokenizer for draft and target models. For our experiments, we chose Llama-3.1-70B-Instruct, a state-of-the-art model, as our target model. For draft models, we selected Qwama-0.5B-Instruct and Llama-3.1-8B-Instruct. Qwama is a well known Qwen2-0.5B-Instruct model with Llama-3 tokenizer.
- Draft Models: Qwama-0.5B-Instruct (BF16), Llama-3.1-8B-Instruct (BF16)
- Target Model: Llama-3.1–70B-Instruct (BF16) with TP 4
- Hardware: Intel Xeon(R) Platinum 8273CL 2.20GHz, 4 NVIDIA A100-SXM 80G GPU, 680GB RAM
Framework Version
- vLLM: v0.6.3
- TensorRT-LLM: v0.14.0 / Triton Server: v2.51.0
Results
Impact of the Number of Draft Tokens
- Draft Model: Qwama-0.5B-Instruct
The first set of experiments examined how the number of draft tokens affects throughput and TPOT. To manage the active batch size, we varied the maximum concurrency (maximum number of requests processed concurrently).
Figure 5 shows that increasing the number of draft tokens from 2 to 4 improves throughput across configurations in vLLM. This effect is particularly noticeable at lower maximum concurrency, where the effective batch size remains small. However, when number of draft tokens exceeds 5, throughput begins to degrade due to the inverse relationship between acceptance ratio and the number of draft tokens.
At higher maximum concurrency levels, the performance gap between configurations with different numbers of draft tokens narrows. Notably, speculative decoding underperforms compared to standard decoding at high concurrency. This result can be attributed to the increased batch size exacerbating the computational overhead of speculative decoding, particularly during the verification stage. The findings suggest that beyond a certain concurrency threshold, the overhead outweighs the benefits of speculative decoding, making standard decoding a more efficient choice in such scenarios.
In case of TensorRT-LLM, as shown in Figure 6, speculative decoding achieves higher throughput than vanilla decoding only when the maximum concurrency is set to 1. As the maximum concurrency increases to 2, 4, and 8, the throughput of vanilla decoding continues to rise, while the throughput of speculative decoding remains constant. This is because the current version of TensorRT-LLM supports speculative decoding only for a batch size 1, making it impractical for serving scenarios with larger batch sizes. For this reason, the following sections focus solely on vLLM, which supports more flexible batch sizes and concurrency levels.
Impact of Input Length
- Draft Model: Qwama-0.5B-Instruct
To evaluate how input length affects speculative decoding performance, we conducted experiments using the two variants of Dynamic-Sonnet dataset with 1K and 2K input lengths. The number of draft tokens was kept constant at 4.
Figure 7 shows that the concurrency threshold at which speculative decoding remained beneficial decreased from 32 to 16 as the input length increased from 1K to 2K. This shift is primarily due to the increased computational cost associated with processing longer contexts. Speculative decoding is more sensitive to these demands than standard decoding because of the heavier verification process. Additionally, the acceptance ratio decreased from 54.0 % to 50.9 % due to the limited capacity of the draft model for longer contexts, further diminishing the performance of the speculative decoding.
Impact of Draft Model Selection and Acceptance Ratio
- Draft Models: Qwama-0.5B-Instruct & Llama-3.1-8B-Instruct
To explore how the size of the draft model affects to the speculative decoding performance, we compared Qwama-0.5B-Instruct with Llama-3.1-8B-Instruct as draft models. We also conducted an experiment ensuring all draft tokens were accepted with a 100% acceptance ratio to assess the ideal scenario.
Figure 8 shows the effect of draft model size and acceptance ratio on through and TPOT. When the draft model is changed to the 8B model instead of 0.5B model, the overall performance of speculative decoding has been reduced.
Under the default settings, the acceptance ratios with the 0.5B draft model and the 8B draft model were 53.5% and 76.5%, respectively. While the larger 8B draft model achieved a higher acceptance ratio due to its greater capacity and accuracy, this advantage was offset by its increased computational cost. The higher inference time for the 8B model reduced the overall throughput and TPOT gains from speculative decoding.
What if we can achieve 100% acceptance ratio with draft model? Then we could achieve ideal performance improvement from speculative decoding. To investigate the upper bound of speculative decoding's performance, we modified the verification process to accept all the draft tokens and measured the performance. Figure 9 shows that ideal case of the 8B draft model showed speed-up competitive to the 0.5B draft model in practice. In addition, speculative decoding with 0.5B draft model could achieve comparable performance with vanilla decoding even with max concurrency of 64. Since, the acceptance ratio can considerably vary across the draft models depending on the dataset, practitioners requires to find the sweet spot by considering the trade-off between computational cost and acceptance ratio to fully leverage speculative decoding.
Limiting Batch Size for Speculative Decoding
- Draft Model: Llama-3.1-8B-Instruct
As shown in previous experiments, speculative decoding is most effective with smaller batch sizes. When the batch size exceeds certain threshold, it would be better not to use the draft model. This hybrid approach can be implemented by manually adding a constraint to the batch size for speculative decoding for better serving performance. In vLLM, this can be controlled by the
speculative_disable_by_batch_size
option, which disables speculative decoding for new incoming requests if the number of enqueued requests exceeds this value. Then it starts serving as vanilla decoding using the target model. Let's consider a scenario where the
speculative_disable_by_batch_size
is set to 64. With this configuration, speculative decoding is enabled when the batch size is below 64, utilizing the draft model to accelerate token generation. However, as requests accumulate and the batch size exceeds 64, subsequent requests are processed using vanilla decoding without the draft model. This approach allows for speculative decoding to be used dynamically, depending on the current batch size.While this article has focused on experiments using concurrency as the primary variable for simplicity in interpreting results, it's important to note that in service scenarios, request rate becomes the key factor. For instance, even at a request rate as low as 4 per second, the accumulation of requests can quickly push the batch size beyond 32 or 64, where speculative decoding's benefits begin to diminish. This highlights the importance of configuring speculative decoding parameters carefully to align with expected request rates and workload patterns.
Final Thoughts
At the beginning, it was mentioned that AI hardware startups are leveraging speculative decoding to maximize API service speeds, often demonstrating dramatic improvements in TPOT. However, as our experiments have shown, speculative decoding is not always beneficial. For large-scale serving scenarios, achieving optimal performance requires balancing throughput and TPOT, a trade-off that becomes increasingly evident under realistic serving conditions.
Our findings reveal that speculative decoding excels at reducing TPOT, but only under scenarios with very small batch sizes. This inherently limits its ability to deliver significant throughput gains, particularly for scenarios with high request rates. Thus, while speculative decoding can be a valuable tool, it's important to carefully evaluate its effectiveness with a certain service scenario. On top of that, careful consideration must be given to key parameters such as the choice of draft model and the optimal number of draft tokens. Each decision has a cascading impact on performance, influencing metrics like acceptance ratio, verification batch size, and memory utilization. Note that while TensorRT-LLM currently lacks full support for speculative decoding in batch sizes greater than one, future updates may make it a more viable option worth revisiting.
It’s also important to acknowledge the limitations of the experiments in this article. First, the results are based solely on the classic Draft-Target method. Leveraging advanced speculative decoding techniques, such as Medusa, may yield even greater performance improvements. Second, all experiments were done on the same Dynamic-Sonnet dataset. Since the effectiveness of speculative decoding is highly dependent on the acceptance rates, its performance can vary significantly across different tasks and models. Testing on diverse datasets and models may reveal substantial variations in overall performance.
Finally, identifying the optimal serving configuration requires extensive experimentation, as we have demonstrated in this article. To facilitate this process, we recommend trying our Fits on Chips service, which is freely available and designed to help practitioners explore and refine their serving setups efficiently.
Share article
Join the SqueezeBits newsletter today!