[vLLM vs TensorRT-LLM] #12. Automatic Prefix Caching
This article provides a comparative analysis of automatic prefix caching.
Introduction
Modern applications leveraging large language models (LLMs) often rely heavily on prompt engineering to achieve optimal results. These prompts are frequently shared across multiple serving instances, effectively functioning as a common prefix for all requests. However, under a naïve implementation, this shared prefix must repeatedly undergo the prefill stage to generate the associated key-value (KV) cache, leading to redundant computation. Automatic Prefix Caching addresses this inefficiency by caching the results of these computations. In this article, we will dive into the concept and implementation of Automatic Prefix Caching and evaluate its practicality in various scenarios.
What is Automatic Prefix Caching?
Automatic Prefix Caching is an inference optimization technique designed to minimize computational overhead during LLM text generation, particularly in scenarios where shared prefixes are present. Figure 1 illustrates the example of shared prefix and the principle of prefix caching. Shared prefixes are common in various serving workflows—for instance, system prompts reused across multiple requests or history tokens in multi-turn conversations. By caching the KV pairs associated with these shared prefixes, subsequent requests can reuse them, reducing both computational cost (by avoiding redundant processing) and memory usage (by sharing KV caches across requests).
The concept is simple: after the attention mechanism computes KV pairs for a given prefix, these pairs are stored in a cache. When the same shared prefix is encountered again, the cached KV pairs are retrieved and reused, bypassing redundant computations in subsequent prefill or decode phases.
Implementation Approaches
The implementation of prefix caching can take several forms.
- Dedicated Memory Cache:
The simplest approach involves allocating specific memory to store KV pairs for predefined prefixes. While this method introduces minimal overhead, it lacks flexibility, as only explicitly defined prefixes are cached. This approach is most effective in scenarios where system prompts are static and predictable.
- Dynamic Prefix Detection:
For cases like multi-turn conversations, where the chat history can differ for each request, the prefix must be identified dynamically. This problem is often solved using data structures like prefix trees (tries). A trie can track shared prefixes and their corresponding KV caches, making it a practical choice for implementing prefix caching in LLMs.
- Prefix Detection with Hash Tables:
Alternatively, hash tables can replace tries, especially efficiently in systems like PagedAttention where KV caches are managed at the page level. Each KV cache page can be indexed using a hash of the prefix tokens combined with the tokens in that page. A hash table then maps logical blocks to actual KV cache pages, enabling efficient lookup and retrieval of shared prefixes.
Automatic Prefix Caching in Practice
In vLLM, Automatic Prefix Caching is implemented using hash tables due to their scalability and compatibility with page-granular KV cache management. The implementation maintains a hash table for each KV cache page, augmented with custom eviction rules based on an LRU (Least Recently Used) policy. Here’s how it works:
- Each prompt is hashed and compared against the hash table to check for an existing cache entry.
- If the prompt is not cached, KV pairs are computed and stored for future use.
- When the cache exceeds a predefined size, the least-used prefix is evicted, freeing space for new entries.
Although the implementation details of TensorRT-LLM are proprietary, it likely follows a similar concept, leveraging hash tables and LRU-based eviction policies to manage KV caches effectively.
Balancing Trade-Offs
While Automatic Prefix Caching significantly reduces latency and improves time-to-first-token (TTFT) by lowering the number of prefill tokens processed, it is not without its challenges. The added overhead of maintaining hash tables becomes noticeable in scenarios with minimal or no shared prefixes, where the caching mechanism provides little benefit.
In the following sections, we will explore experimental results comparing the performance of Automatic Prefix Caching in vLLM and TensorRT-LLM, highlighting both its advantages and potential drawbacks to help you determine whether this feature is right for your use case.
Experiment Setup
Benchmark Dataset
- Dynamic: Dynamic-Sonnet 1K, 2K, 4K
- Fixed: random tokens with fixed input and output lengths corresponding to those of the dynamic dataset 1K, 2K, 4K
For the fair comparison, the output length was fixed at 1K tokens across all datasets. Note that each Dynamic-Sonnet dataset shares a common prefix whose length is a quarter of the input length. The number of requests was set to 256, and the
max_batch_size
was set to 128 in all the experiments.Model and Hardware Specification
- Model: Llama-3.1-8B-Instruct (BF16)
- Hardware: Intel Xeon(R) Platinum 8273CL 2.20GHz, 1 NVIDIA A100-SXM 80G GPU, 170GB RAM
Framework Version
- vLLM: v0.6.3
- TensorRT-LLM: v0.15.0 / Triton Server: v2.52.0
Results
Overhead of Automatic Prefix Caching
To measure the computational overhead introduced by Automatic Prefix Caching, we compared performance metrics with and without prefix caching using a random dataset without shared prefixes. As depicted in Figure 2, the overhead was relatively minor in TensorRT-LLM, but significant in vLLM, with a throughput reduction of ~36.7% and a TPOT increase of ~25.0%.
In vLLM, the overhead became more pronounced with longer input sequences or higher concurrency. This can be attributed to the increasing time cost of hash function computations and hash table comparisons, which scale with input length and concurrency. When the cache hit ratio remains extremely low—as in datasets without shared prefixes—these increased time costs are directly reflected as performance overhead.
The impact of this overhead is also evident in Figure 3, where TTFT performance deteriorated with APC enabled. The current implementation of Automatic Prefix Caching in vLLM relies heavily on the Python-native hash function. While efficient, this approach still introduces non-trivial computational overhead.
By contrast, although the implementation details of TensorRT-LLM are proprietary, the experimental results suggest that it achieves greater efficiency by optimizing KV cache management. This implies the presence of highly optimized hash functions, hash management strategies, and computational kernels in TensorRT-LLM. Such optimizations likely mitigate the performance impact of APC, particularly under conditions of minimal cache hits.
Impact of Automatic Prefix Caching on Dataset with Shared Prefix
To analyze the effectiveness of Automatic Prefix Caching with datasets sharing a common system prompt as a prefix, we conducted experiments under the same conditions as in the previous benchmarks. As shown in Figure 4, Automatic Prefix Caching significantly improved performance for both TensorRT-LLM and vLLM, irrespective of input length or concurrency levels.
For TensorRT-LLM, throughput improved by ~34.7%, and TPOT saw a ~20.9% gain, while vLLM achieved more modest improvements of ~13.3% in throughput and ~9.8% in TPOT. These performance gains can be attributed to the efficient reuse of cached KV pairs, which minimizes redundant KV operations for shared prefixes. This optimization enhances throughput and TPOT while also improving TTFT (Figure 5), as cached KV values are directly retrieved, bypassing the need for repetitive calculations.
However, while TensorRT-LLM consistently demonstrated throughput gains, even with larger datasets (e.g., 4K tokens), vLLM exhibited unexpected performance degradation when Automatic Prefix Caching was enabled under high concurrency. In these scenarios, throughput and TPOT metrics underperformed compared to baseline benchmarks without caching.
This decline stems from preemption issues caused by the limited availability of KV cache pages under conditions of increased input sequence lengths and larger batch sizes. Although Automatic Prefix Caching in vLLM v0.6.3 operates as intended, a significant limitation exists in the request scheduler. The current scheduler does not account for cache hits when calculating prefill tokens during scheduling, which constrains the number of batched prefill requests and reduces KV cache utilization. Consequently, this results in inefficient scheduling and degraded performance, especially for longer sequences and larger batch sizes.
The recently released vLLM v0.6.5 addresses this issue, introducing an updated scheduling mechanism expected to mitigate these limitations and enhance performance under such conditions.
Impact of the Rate of Prefix
To evaluate how varying the ratio of shared prefixes within inputs of the same length affects throughput, we conducted benchmarks using a custom dataset with controlled shared prefix ratios. To isolate the performance impact of Automatic Prefix Caching from memory optimization effects, we used a fixed input length of 1K tokens and an output length of 1K tokens, with 128 concurrency. This setup ensured no preemption issues, even without caching.
As illustrated in Figure 6, increasing the shared prefix ratio significantly reduces the number of prefill tokens requiring computation, thereby improving throughput. For vLLM, raising the shared prefix ratio from 0.1 to 0.9 yielded a 32% improvement in throughput. In comparison, TensorRT-LLM achieved a more pronounced 49% throughput gain under the same conditions.
Interestingly, while vLLM demonstrated a steady increase in throughput as the shared prefix ratio grew, TensorRT-LLM exhibited a performance dip around a shared prefix ratio of 0.5. This anomaly may stem from the granularity of prefix caching and the corresponding prefill phase implementation in TensorRT-LLM. However, due to the proprietary nature of its codebase, we lack definitive insights into the underlying cause for now.
Final Thoughts
Automatic prefix caching is a powerful optimization technique that can enhance the efficiency of LLM inference, particularly in scenarios where common prefixes can be shared such as system prompts or dialogue-based AI. By reducing redundant computations on these prefixes, this optimization strategy can significantly improve the performance of LLM applications where multiple requests share substantial portions of their input.
However, the effectiveness of automatic prefix caching is contingent on the specific framework and scenario. In the case of vLLM, while prefix caching can introduce overhead in situations without shared prefixes due to cache management costs, it can still prove beneficial in many cases where prefixes are shared, delivering noticeable performance improvements. Conversely, TensorRT-LLM exhibits negligible overhead even when prefixes are not shared. Furthermore, in scenarios with shared prefixes, TensorRT-LLM achieves higher performance gains from automatic prefix caching compared to vLLM.
For those seeking to optimize LLM performance, evaluating the applicability of automatic prefix caching based on the usage scenario is crucial. Understanding the nature of the workload, specifically whether prefixes are shared or not, can help determine whether automatic prefix caching will provide tangible benefits.
To explore the potential impact of automatic prefix caching on your specific use case, we encourage you to try our FitsOnChips service, freely available tool that is designed to facilitate the refinement of serving setups efficiently. It will help practitioners experiment with automatic prefix caching configurations and compare performance with and without prefix caching.
Share article
Join the SqueezeBits newsletter today!