[vLLM vs TensorRT-LLM] #3. Understanding Sampling Methods and Their Performance Impact
This article provides a comparative analysis of vLLM and TensorRT-LLM frameworks with various sampling methods.
Oct 18, 2024
Contents
Understanding Sampling MethodsGreedy SamplingTop-K SamplingTop-P (Nucleus) SamplingRepetition PenaltyExperimental SetupBenchmark DatasetModel and Hardware SpecificationFramework VersionExperimental ResultsResults with Different Request RateResults with Different Batch SizesAblation Study on Sampling MethodsMinor Code PatchFinal ThoughtsLarge Language Models (LLMs) generate text by predicting the next token based on the context provided according to the probability distribution of possible tokens. Greedy sampling, the simplest sampling method, selects the token with the highest probability at each step. However, this deterministic approach often suffers from generating repetitive and less creative sentences. To address the problem, various sampling methods like Top-K, Top-P, or repetition penalty have been proposed. Most of LLM serving frameworks (e.g. vLLM and TensorRT-LLM) support these sampling techniques, allowing users to adjust the balance between creativity and coherence. However, these methods increase computational costs, impacting serving performance like token throughput, Time-to-First-Token (TTFT) and Time-per-Output-Token (TPOT).
In this article, we will start by exploring key sampling techniques: Top-K, Top-P, and repetition penalty. Then, we will assess the performance overhead of these techniques under different configurations on both the TensorRT-LLM and vLLM frameworks.
Understanding Sampling Methods
Greedy Sampling
Greedy sampling simply selects the token with the highest probability in each iteration (Figure 1). While it provides predictable outputs, making it very useful for development or debugging, it often leads to repetitive (also known as “degeneration”) and less diverse outputs. Consider the following example generated by Llama-3-8B, which overuses the word "conscious."
Input: "Hey, are you conscious?" # Output generated by Llama-3-8B using greedy sampling Output: "I mean, are you really conscious? I mean, are you really conscious of your consciousness? I mean, are you really conscious of your consciousness of your consciousness"
Top-K Sampling
To improve output diversity, sampling-based methods like Top-K sampling were introduced. Top-K sampling allows tokens that do not have the highest probability to still have chance of being selected. As shown in Figure 2, tokens are ranked by their probabilities as calculated by the LLM, but only the top K tokens are kept for consideration. The probabilities of these K tokens are then normalized to ensure they sum to 1, and the next token is randomly chosen based on these normalized probabilities. This approach introduces controlled randomness, helping to diversify the generated outputs while still preventing the selection of highly unlikely tokens.
Input: "Hey, are you conscious?" # Output generated by Llama-3-8B using greedy sampling Output: "I mean, are you really conscious? I mean, are you really conscious of your consciousness? I mean, are you really conscious of your consciousness of your consciousness" # Outputs generated by Llama-3-8B using Top-K (K=50) with two different seeds Output 1: "Are you aware of the fact that there are only 8 hours left in the year 2018? The year 2018 is going to end in a" Output 2: "You have to be conscious to be able to do the things you need to do in life. Consciousness is a state of being that enables you to be"
The examples above compare the output sentence generated by greedy sampling and Top-K sampling (K = 50). Top-K sampling reduces repetitions and produces more diverse sentences compared to the deterministic decoding.
Top-P (Nucleus) Sampling
Top-P sampling, or nucleus sampling, follows a similar process to Top-K sampling but differs in how the candidate token set is selected. Instead of limiting the set to a fixed number of tokens (K), Top-P sampling dynamically selects the set of tokens whose cumulative probability exceeds a predefined threshold P (e.g., 0.9).
This dynamic method offers greater flexibility, as the number of candidate tokens can vary depending on the context of the generation. By adjusting the threshold P, the model can control how many tokens are considered at each step, allowing for a balance between diversity and coherence in the generated output.
Note that Top-K and Top-P sampling can be used together as shown in the following example. When used together, the token set is first limited to K candidates, and then further narrowed down by including only those tokens whose cumulative probability meets the threshold P.
Input: "Hey, are you conscious?" # Output generated by Llama-3-8B using greedy sampling Output: "I mean, are you really conscious? I mean, are you really conscious of your consciousness? I mean, are you really conscious of your consciousness of your consciousness" # Output generated by Llama-3-8B using Top-P (P=0.9) Output: "That's a good question to ask yourself, because it is a question that we all need to ask ourselves from time to time. Consciousness is the ability to" # Output generated by Llama-3-8B using Top-K (K=50) and Top-P (P=0.9) Output: "It’s time to get up.\nI’m sure you’re still sleeping. You’ve been sleeping for 30 years. You’ve been dreaming of a better life"
Both Top-K sampling and Top-P sampling heavily rely on the probability distribution of the tokens, and Temperature (T) is another useful parameter which allows users to control the probability distribution. As illustrated in Figure 4, lowering the temperature sharpens the probability distribution, making it more likely that the model will select the highest-probability tokens. This reduces the overall randomness of the output and results in more coherent, predictable sentences (Figure 4b). On the other hand, increasing the temperature smooths the probability distribution, making the model more likely to choose lower-probability tokens. This can generate more creative and diverse text, but it also increases the risk of generating less coherent results (Figure 4c). Proper temperature setting allows users to fine-tune the balance between coherence and diversity, depending on the application.
Repetition Penalty
Repetition penalty is another important technique used to discourage the model from generating repetitive outputs. This method penalizes tokens that have already been selected in the previous steps, thereby lowering their probability and reducing the likelihood of them being chosen again. For example, with a repetition penalty of 1.1, the Llama-3-8B model produces the following output.
Input: "Hey, are you conscious?" # Output generated by Llama-3-8B using greedy sampling Output: "I mean, are you really conscious? I mean, are you really conscious of your consciousness? I mean, are you really conscious of your consciousness of your consciousness" # Output generated by Llama-3-8B using repetition penalty (1.1) Output : "I mean really conscious?\nI'm not talking about being aware of your surroundings or the people around you. I'm talking about being aware of yourself and what's" # Output generated by Llama-3-8B using Top-K (K=50), Top-P (P=0.9), and repetition penalty (1.1) Output: "I’m just asking. You know what I mean: Are you really awake and aware of your life?\nI have a friend who is always saying she’s “"
By tuning the repetition penalty parameter, users can mitigate degeneration while maintaining sentence coherence. However, a high penalty might result in less cohesive outputs, as it may excessively penalize tokens that are necessary for proper sentence structure. There are additional methods to control repetitive outputs: frequency penalty and presence penalty. They both give penalty by subtracting some amount from logits, while repetition penalty scales the logits (See Code 1). Additionally, frequency penalty is applied based on the number of repetitions, whereas the others only penalize based on presence. Tuning these parameters simultaneously can allow for finer control over repetitive outputs across different service environments. In this post, we only use repetition penalty for simplicity, but the overhead of the others is not significantly different.
repetition_penalties[~(prompt_mask | output_mask)] = 1.0 logits = torch.where(logits > 0, logits / repetition_penalties, logits * repetition_penalties) logits -= frequency_penalties.unsqueeze_(dim=1) * output_bin_counts logits -= presence_penalties.unsqueeze_(dim=1) * output_mask
It’s also important to note that these penalties can be applied in combination with Top-P and/or Top-K sampling in parallel. With combining the sampling methods, LLM can generate more diverse and creative sentences.
Experimental Setup
Benchmark Dataset
To evaluate the impact of various sampling methods, we utilized a real dataset, ShareGPT, rather than a random fixed dataset as in previous articles. Specifically, we sampled 1,000 examples from the
ShareGPT_Vicuna_unfiltered
dataset.The average input length of the prompts was approximately 225 tokens, while the output lengths were fixed at 1,024 tokens to create a Decode-heavy scenario. This focus on decode-heavy tasks allow us to better evalute how sampling affects performance, as these methods primarily influence the decoding phase, not prefill-heavy scenarios (which have shorter outputs).
Model and Hardware Specification
- Model: Llama-3–8B (BF16)
- H/W: Intel Xeon(R) 2.20GHz (12 cores), 1 NVIDIA A100-SXM 80G GPU, 128GB RAM
Framework Version
Since TensorRT-LLM C++ API benchmark tool originally does not support sampling options, we adopted the measurement approach used in vLLM benchmark. We used the Llama-3–8B (BF16) with Triton Inference Server, and measured throughput, TTFT, and TPOT on the sampled sentences using
benchmarks/benchmark_serving.py
script from the vLLM source.- vLLM (v0.6.2)
- TensorRT-LLM (0.12.0.dev24080600, Triton Inference Server Release 24.08)
Experimental Results
Results with Different Request Rate
We first evaluated the performance difference between using all the previously introduced sampling techniques together and the greedy method. We varied the request rate from 1 to 8 and analyzed the impact of sampling on throughput, TTFT and TPOT. For each sampling method, the parameters were set as follows.
Top-P (P=0.9), Top-K (K=50), Temperature (T=4), Repetition Penalty (1.1)
As shown in Figure 5, sampling methods caused a noticeable decrease in throughput when request rate is larger than 2. When the request rate is 8, the throughput dropped by 15.4 % in vLLM and 7.1 % in TensorRT-LLM, respectively, when sampling techniques were applied. Meanwhile, there was almost no performance degradation due to sampling when request rate is relatively low. This is because the workload becomes more compute-bounded as the request rate increases. Higher request rate leads to larger running batch size (introduced in previous post) and higher operational density. In compute-bound conditions, the extra computational overhead of sampling techniques becomes more pronounced, while the overhead is negligible with lower request rates (memory-bound conditions). TTFT and TPOT were similarly affected, with vLLM showing a more significant TTFT degradation. For TPOT, vLLM exhibited a larger drop (20.6 %) compared to TensorRT-LLM (9.2 %).
The difference in performance degradation between the two frameworks likely stems from differences in their implementation of the sampling process. While vLLM relies on Python-based sampling implementations (link), TensorRT-LLM uses custom CUDA kernels and low-level GPU optimizations to minimize the overhead (link). As vLLM continues to evolve, with ongoing effort to adopt dedicated CUDA kernels, this gap could narrow in the future.
Results with Different Batch Sizes
As mentioned earlier, the sampling overhead becomes more pronounced in compute-bound scenarios. To further verify this, we conducted additional experiments by varying the batch size. As discussed in previous post, adjusting the max batch size parameter allows us to shift the workload towards a more compute-bound setting. We measured the three metrics at a request rate of 8, varying the max batch size parameter for each framework. As shown in Figure 6, the largest performance degradation occurred at a max batch size of 256 for both frameworks, which is the default value. In addition, the gap between greedy and sampling cases narrowed with decreased max batch sizes, as the the workload became more memory-bound.
Ablation Study on Sampling Methods
To dive deeper into the effect of individual sampling techniques, we further conducted ablation studies on three sampling methods. The experiments were conducted at request rate of 8 and a max batch size of 256, where the sampling overhead was most noticeable.
When comparing each sampling method applied individually, we found that the overhead was largest for Top-K, followed by Top-P, and smallest for repetition penalty. Notably, the overhead for repetition penalty was minimal compared to Top-K and Top-P sampling, where sorting algorithms are required. In the case of TensorRT-LLM, the overhead from repetition penalty was almost negligible. Overall, sampling overhead was 2-3 times greater in vLLM than in TensorRT-LLM, with TPOT in vLLM degrading by over 20% when all sampling methods were used together.
Minor Code Patch
Unlike our previous posts, this time we had to account for sampling, which led us to use Triton Inference Server for TensorRT-LLM. However, when we first began benchmarking, we encountered significantly reduced throughput - nearly three times slower than expected. When we benchmarked using C-API benchmark tool without sampling, the throughput was around 3,127 tokens/s. However when using Triton Inference Server to utilize sampling, the throughput dropped to 313 tokens/s, which is 10x times slower. This drop in performance was far too drastic to be explained by sampling overhead alone.
While investigating the issue, we discovered a minor issue within the TensorRT-LLM and Triton Inference Server setup. The problem was related to the post-processing deployed on Triton Inference Server. Specifically, the way the
AutoTokenizer
from the transformers
library handles the vocabulary had a significant impact on performance. The current implementation of post-processing in Triton Inference Server with TensorRT-LLM backend relies on calling len(self.tokenizer.vocab)
to calculate the vocabulary size (link). This means the vocabulary dictionary object is being created repeatedly, causing noticeable performance degradation.We applied a simple patch to address this problem. We pre-calculated the vocabulary size during the initialization phase and reused it instead of repeatedly calling
len(self.tokenizer.vocab)
during post-processing (link) . With this minor patch, we could successfully eliminate the unexpected overhead and achieve reasonable numbers with sampling methods. Note that all the release version of Triton Inference Server with TensorRT-LLM backend appear to have this same issue, so caution is required.Final Thoughts
In this post, we explored various sampling techniques and their impact on performance of both vLLM and TensorRT-LLM. Our experiments showed that the sampling overhead becomes more pronounced in compute-bound scenarios, such as high request rates, decode-heavy datasets, or large batch sizes. In such cases, careful consideration of sampling overhead is required.
We observed that the overhead was more significant in vLLM compared to TensorRT-LLM, likely due to differences in implementation. In vLLM, methods that require sorting, such as Top-P and Top-K sampling, showed a greater overhead than repetition penalty.
In upcoming posts, we will continue to explore various advanced features for real-world LLM deployments, such as structured output, prefix caching, multi-LoRA support, and etc. We hope that these insights will provide practitioners with useful information to make informed choices in their LLM deployment strategies.
Stay tuned for more insights in the vLLM vs TensorRT-LLM series!
Share article
Join the SqueezeBits newsletter today!