[vLLM vs TensorRT-LLM] #10 Serving Multiple LoRAs at Once
This article provides a comparative analysis of multi-LoRA serving capabilities of vLLM and TensorRT-LLM frameworks.
Dec 05, 2024
Introduction
As Large Language Models (LLMs) gain traction in real-world applications, task-specific fine-tuned models are becoming increasingly popular. To achieve the best performance with a given foundation model size, fine-tuning pre-trained LLMs for specific tasks is a widely used approach. However, training all the parameters in LLMs is computationally intensive. To address this, the Low-Rank Adaptation (LoRA) technique has emerged as a favored solution.
LoRA not only accelerates fine-tuning by training a smaller set of parameters but also significantly reduces the parameter size needed for task adaptation. This approach enables the use of a single foundation model alongside multiple LoRA modules, making it possible to cater to a variety of task-specific applications in a highly resource-efficient manner.
In this post, we will explore how vLLM and TensorRT-LLM support Multi-LoRA serving scenarios and compare their capabilities in this context.
Serving LLMs with LoRA Modules
In previous blog posts, we have explored scenarios where LLMs are served without LoRA modules. In fact, even when a LoRA module is used, there is no difference from a serving perspective if only a single LoRA module is used. This is because a LoRA module can be fused directly into the corresponding layers of the LLM. In other words, after training is completed, the LoRA weights can be merged with the LLM weights, allowing the fine-tuned LLM to be served without any additional overhead.
However, the situation becomes significantly more complex when multiple LoRA modules are used with a single foundation model. The simplest approach to use multiple LoRA modules is shown in the left diagram of Figure 1. Here, each dedicated LLM variant, such as LLM-A (base LLM fused with LoRA module a), LLM-B (fused with LoRA module b), and LLM-C (fused with LoRA module c) is created. Each of these variants is treated as a separate model during serving. While straightforward, this method is highly memory-inefficient, requiring multiple times larger memory footprint for all the model variants (3x in the figure).
To address this inefficiency, the Multi-LoRA serving approach was introduced, as illustrated in the right diagram of Figure 1. Instead of creating separate LLM variants for each LoRA module, this method retains a shared base LLM and performs additional computations for LoRA modules separately. While this sacrifices the overhead-free advantage of fusing LoRA modules directly into the base LLM, it offers significantly better memory efficiency in scenarios requiring multiple LoRA modules.
While Multi-LoRA serving is more efficient than serving multiple fine-tuned models independently, there still exist several optimization challenges. For example, deciding where to store LoRA modules, determining how many different modules can be processed simultaneously within a single batch, or batching requests that require different LoRA modules may affect Multi-LoRA serving performance. Additionally, the characteristics of the LoRA modules themselves—such as the rank or the number of layers—can affect the overhead required for their computations.
Figure 2 illustrates a typical Multi-LoRA serving scenario. In this example, the server is configured to support three different LoRA modules (a,b, and c). Since not all of these modules are needed for every request, they are initially stored in host memory. When a user sends a request with a specific LoRA module id, the server loads the necessary modules into GPU memory, performs computations for the base LLM and LoRA modules in parallel, and combines the results before passing them to the next layer for inference.
Once a LoRA module is loaded into GPU memory, it can be cached for reuse. This eliminates host-to-GPU communication for subsequent requests requiring the same module. However, due to the limited GPU memory available for storing LoRA modules, evictions may occur, leading to frequent host-GPU communication when swapping modules.
To further optimize performance, servers often group requests using the same LoRA module into batches. By processing these grouped requests together, the system achieves better resource utilization and throughput.
As mentioned earlier, one of the optimization points in Multi-LoRA serving is how much GPU memory to allocate for LoRA modules. Allocating more memory for LoRA modules reduces the space available for the KV cache, while allocating too little memory limits the number of LoRA modules that can be processed simultaneously, potentially lowering throughput. Therefore finding the right balance is essential to achieving optimal performance.
vLLM vs TensorRT-LLM: Approaches to Multi-LoRA Serving
Efficient Multi-LoRA serving requires frameworks to carefully manage memory allocation for LoRA modules while providing users the flexibility to control memory usage through various options. Both vLLM and TensorRT-LLM address these needs using distinct approaches.
vLLM: Adaptability for Dynamic Workloads
vLLM’s Multi-LoRA implementation emphasizes memory bandwidth optimization, particularly for scenarios where a single baseline model is shared across multiple LoRA modules with inputs of varying sequence lengths. Its asynchronous execution pipeline is designed to maximize throughput under dynamic workloads while maintaining memory efficiency. Additionally, its hash-based KV cache system integrates LoRA IDs into the hashing process, improving cache hit rates and reducing the memory bandwidth required to load LoRA modules.
A key component enabling this functionality is the Punica kernel, designed specifically for dynamically loading and managing LoRA modules during inference. It optimizes performance in multi-LoRA serving scenarios by ensuring that only the necessary modules are loaded into GPU memory. The kernel supports lazy loading and eviction, optimizing memory usage by transferring modules between GPU and CPU memory as needed. Metadata tracking, which includes module states and memory locations, further facilitates dynamic selection and switching of modules to streamline inference pipelines. In vLLM, the Punica kernel is specifically designed to manage the loading of LoRA modules and efficiently execute their associated mathematical operations. Its modular architecture ensures that it integrates seamlessly into vLLM's execution pipeline without interfering with other essential LLM features, such as scheduling and post-processing. This design enhances both the flexibility and overall efficiency of the system.
LoRA-related parameters in vLLM are often defined as maximum values to accommodate the framework's support for both static loading at server initialization and dynamic updates via APIs. Dynamic updates allow for the on-the-fly addition or removal of modules, enabling efficient management of diverse LoRA modules. While this approach may not achieve optimal memory usage down to the last bit, it offers significant flexibility, making it well-suited for environments with complex task requirements.
TensorRT-LLM: Performance-Tuned Consistency
TensorRT-LLM focuses on delivering consistent, high-throughput performance optimized for NVIDIA GPUs. It uses a memory preallocation strategy for LoRA module caching, controlled via
lora_cache_host_memory_bytes
. While this ensures predictable resource allocation and low latency, it introduces several limitations requiring careful configuration.First, the maximum number of LoRA modules depends on the size of preallocated GPU or host memory. Unlike vLLM, TensorRT-LLM requires additional setup during the model engine build phase. Developers must specify both the maximum rank of LoRA modules and the target linear layers where LoRA modules will be applied, which adds rigidity to the setup process.
Another important consideration is the cache eviction mechanism. Instead of setting a hard limit on the number of LoRA modules, TensorRT-LLM limits the maximum cache size and explicitly manages both host and device caches. When the cache size is exceeded, previously cached LoRA weights are evicted. This approach allows for more flexible memory utilization, allowing the LoRA cache to be dynamically divided among many smaller LoRA modules or a few larger ones as needed. However, this explicit management requires users to define both the host and device cache sizes, introducing an additional optimization parameter. While this provides greater control, it also increases the complexity of optimizing LoRA serving performance, adding another layer of effort for users to balance memory and efficiency effectively.
TensorRT-LLM’s approach to LoRA caching requires that the client sends both LoRA weights and IDs in the initial request. When LoRA weights are evicted even from the host cache, clients must also resend the weights alongside their requests. This feature enhances LoRA module flexibility but raises potential risks, such as exposing host memory to client-side operations. Another issue with this feature is the transmission protocol. LoRA weights are transmitted as NumPy files, which the server converts into TensorRT format for efficient caching. However, current limitations arise when working with
bfloat16
models. As NumPy basically does not support bfloat16
format, compatibility issues may occur during weight conversion. To address this, weights must first be converted to binary or float16
format before transmission. These challenges are expected to be resolved in future updates. Despite these challenges, TensorRT-LLM’s memory caching and GPU optimizations ensure predictable performance, making it ideal for stable workloads.Performance Evaluation
Experimental Setup
To evaluate the performance of vLLM and TensorRT-LLM for Multi-LoRA serving, we conducted experiments under the following setup:
- Model: Llama-3.1–8B-Instruct
- GPU: NVIDIA A100 (PCIe) 80GB
- CPU: Intel(R) Xeon(R) Gold 6338 @ 2.00GHz
- Framework Versions
- vLLM: 0.6.3
- TensorRT-LLM: v0.14.0 / Triton Server: 2.51.0
- Benchmark Dataset: Fixed lengths (1K input & 1K output) random token dataset
When serving LLMs with LoRA modules, there are two different scenarios in terms of who provides the LoRA modules. First, supported LoRA modules can be pre-defined by the server and pre-loaded at the beginning of initialization. In this case, server receives only the index of the desired LoRA module with each request from the user. On the other hand, users can send requests with LoRA module itself (including weights) to the server. In this post, we decided to focus on benchmarking in a scenario where all the required LoRA modules are pre-loaded. For this purpose, in vLLM, the
VLLM_ALLOW_RUNTIME_LORA_UPDATING
flag was set to its default value of false
.In the case of TensorRT-LLM settings, there are a few things to note. While A100 GPUs support both
bfloat16
and float16
, we opted for float16
with TensorRT-LLM in the Multi-LoRA experiments. This choice was made because float16
delivers nearly identical latency performance and offers better compatibility with gRPC communication, as NumPy does not support bfloat16
. When building a TensorRT-LLM engine compatible with LoRA modules, it is necessary to specify the types of linear layers to which the LoRA module will be applied. For vLLM, we apply LoRA modules to the specific attention layers corresponding to the LoRA configuration. However, for TensorRT-LLM, all attention layers (Q, K, and V) are tied and enabled together if a LoRA module is applied to any one of them. This is required due to the way TensorRT-LLM handles attention mechanisms—if LoRA modules are applied to any attention layer, such as Q or V, the engine must be configured to support LoRA modules across all attention layers, including Q, K, and V.
LoRA Ranks & Target Layers
We first analyzed how the characteristics of LoRA modules impact Multi-LoRA serving throughput. Specifically, we experimented with two key variables: the LoRA rank and the target layers where the LoRA modules are applied.
The LoRA rank, as shown in Figure 2, determines the shape of the LoRA layers. A higher rank increases the number of parameters, which enhances the power of the LoRA module. However, this improvement comes with a trade-off—higher computational and memory overhead.
Meanwhile, LoRA modules can be applied to various linear layers of the base LLM. Two most common configurations are as follows:
- Applying LoRA modules only to the Query projection and Value projection layers within the attention mechanism.
- Extending LoRA modules to the Key projection and MLP layers, effectively applying LoRA to all linear layers of the model.
Other configurations, such as applying LoRA modules to the lm_head, also exist, but this analysis focuses on the differences between the two main configurations mentioned above.
Figure 3 illustrates how throughput changes for vLLM and TensorRT-LLM as the LoRA rank and target layers vary. In this experiment, the maximum batch size was set to 256, with a total of 1,024 requests distributed across 16 different LoRA modules. Requests were designed such that LoRA id 0 through 15 were used in a round-robin manner to ensure fair usage of all LoRA modules. Additionally, the request rate was set to infinite, meaning all 1,024 requests were queued from the start. Therefore, requests using the same LoRA module could be batched together for inference, maximizing efficiency.
Overall, TensorRT-LLM demonstrated significantly higher throughput compared to vLLM across all configurations. Remarkably, there were cases where TensorRT-LLM was even faster while performing Multi-LoRA serving than vLLM operating without any LoRA modules.
A closer examination reveals that as the LoRA rank increased from 8 to 64, vLLM experienced a noticeable decline in throughput, whereas TensorRT-LLM maintained stable performance with little variation. When applying LoRA to all linear layers, vLLM showed a throughput degradation of 23.9%–47.0%, depending on the rank of the LoRA modules, while TensorRT-LLM exhibited a slightly higher degradation of 40.0%–47.7%.
Additionally, while vLLM showed minimal differences in performance between configurations that applied LoRA to Query and Value (QV) layers versus all linear layers (All), TensorRT-LLM exhibited a significant throughput gap between the two. Notably, TensorRT-LLM achieved performance closer to the baseline with the QV configuration, which involves fewer layers to adopt LoRA , showing only 14.1%–18.6% throughput degradation. In contrast, vLLM exhibited 23.8%–46.1% degradation with the QV configuration, which is nearly identical to the degradation observed when LoRA was applied to all linear layers.
From the results above, we observed that vLLM is significantly influenced by the LoRA rank, whereas TensorRT-LLM is more sensitive to the target layers. To further analyze the impact of these variables, we conducted additional experiments, focusing on the throughput-to-TPOT trade-off. Instead of fixing the batch size at 256, this time we varied the concurrency from 4 to 256 and measured the throughput and TPOT.
As shown in Figure 4, lower ranks in vLLM shifted the graph toward the top-left, achieving higher throughput and lower TPOT. For TensorRT-LLM, the QV configuration consistently outperformed the All configuration, delivering better performance. While these results may seem intuitive, it is interesting to note how the influential variables differ between the two frameworks. Results also highlight the importance of selecting the optimal LoRA configuration to balance overhead and qualitative performance effectively.
Number of LoRA Modules
Next, we analyzed the impact of the number of supported LoRA modules in a Multi-LoRA serving environment. In previous experiments, we tested with 16 different LoRA modules. For this analysis, we varied the number of LoRA modules from 2 to 64 and measured the resulting throughput.
In vLLM, this configuration is controlled by two variables:
max_cpu_loras
and max_loras
. The max_cpu_loras
variable determines the total number of LoRA modules the server can support, which affects host memory usage. On the other hand, max_loras
sets the maximum number of LoRA modules that can be used simultaneously in a single batch, influencing the amount of GPU memory allocated for LoRA computations. In this experiment, we set these two values to be equal and varied them between 2 and 64 to measure their impact.Additionally, we experimented with different input sequence (prompt) lengths, extending beyond the default 1K to 2K, 4K, and 8K. This extension was motivated by the following reasoning: increasing the number of supported LoRA modules inherently allocates more GPU memory for LoRA modules, thereby reducing the space available for the KV cache. Since the KV cache becomes more dominant for longer sequences, we expected its impact to be more pronounced in such scenarios.
Surprisingly, as shown in Figure 5, the throughput remained almost constant regardless of the number of supported LoRA modules for both vLLM and TensorRT-LLM. While there was about a 10% difference in throughput when comparing 2 LoRA modules to 64, the trend was not as dramatic as expected. Note that in the case of TensorRT-LLM, since it does not explicitly control the number of LoRA modules but instead manages the amount of GPU memory allocated for LoRA, minor fluctuations in performance were observed.
From this observation, we can conclude that the number of LoRA modules in a batch does not significantly impact overall throughput, provided there are enough requests per module to fully utilize the active batch size.
Limited LoRA Modules on GPU
In the previous experiment, we kept the total number of LoRA modules supported by the server equal to the number that could be loaded into the GPU. However, this setup did not consider two potential issues commonly encounterd in multi-LoRA serving. First, it eliminated the possibility of LoRA evictions, ensuring that all LoRA modules remained cached on the GPU without needing to reload them from host memory. Second, because all incoming requests corresponded to the LoRA modules already loaded in GPU memory, they could be batched together without restrictions.
This time, we adjusted the experimental setup to evaluate the impact of limited GPU memory allocation for the LoRA cache. The total number of LoRA modules supported by the server was fixed at 64, while the size of the GPU LoRA cache was varied. For vLLM, the number of LoRA modules that could be simultaneously loaded into the GPU ranged from 2 to 64. For TensorRT-LLM, the memory fraction of the GPU allocated for the LoRA cache was varied between 0.02 and 0.36. Throughput was measured under two concurrency settings, 128 and 256, to determine whether the active batch size might be limited by the number of loaded LoRA modules. In this setup, all requests were assigned 64 LoRA IDs, regardless of the number of modules loaded into the GPU. As a result, a reduced number of loaded LoRA modules could constrain the active batch size, as only requests corresponding to loaded LoRA modules can be batched for computation.
As shown in Figure 6, throughput increased rapidly as the number of LoRA modules that could be loaded into the GPU simultaneously grew. This doesn’t imply that merely increasing the number of LoRA modules directly improves throughput but suggests that a higher number of loadable modules can enable a larger active batch size, leading to better throughput. To clarify this concept, let’s consider a scenario where only two LoRA modules can be loaded at a time. With a concurrency of 128, the 128 requests would consist of two sets of requests, each requiring LoRA modules 0 through 63 in order. In this case, only two requests for each LoRA module are queued. Since only two LoRA modules can be loaded at once, only four requests can be batched and processed in parallel at first. This limits the active batch size to just four, resulting in low throughput. While the active batch size may increase as more requests arrive after the first batch inference, the number of requests that can be processed simultaneously is heavily constrained by the number of LoRA modules that can be loaded, which explains the low throughput. As more LoRA modules are allowed to load, throughput increases significantly.
Meanwhile, when examining the vLLM results for concurrency 256, we observe an interesting phenomenon: throughput declined as the number of LoRA modules loaded simultaneously becomes 64. At this point, all 256 queued requests can theoretically be processed at once. This results in vLLM initially executing with a very large batch size, but the extensive memory required to store LoRA modules leaves relatively little space for the KV cache. This leads to frequent preemption and a subsequent drop in throughput.
On the other hand, TensorRT-LLM, with its default scheduling policy of Guaranteed_no_evict, avoids preemption entirely. As a result, it does not exhibit the same drop in throughput seen with vLLM. Instead, TensorRT-LLM experiences only minor decreases in batch size due to the reduced KV cache space, allowing it to maintain stable throughput throughout the experiment.
Final Thoughts
As this series has explored, the performance of vLLM and TensorRT-LLM in Multi-LoRA serving scenarios depends on a variety of factors, including LoRA rank, target layers, the number of supported modules, and GPU memory allocation strategies. Each framework exhibits unique strengths and trade-offs, influenced by these variables.
While TensorRT-LLM consistently demonstrated superior performance in our experiments, it’s important to note that the usability of the framework presented significant challenges during testing. In contrast, vLLM proved to be much more intuitive and convenient to use, especially from a development and operational standpoint. This distinction cannot be overlooked, as usability often plays a critical role in real-world implementations. For use cases requiring frequent updates or ongoing development, vLLM might be the better choice from a business perspective, despite its lower throughput in certain configurations.
The findings in this post reinforce a key takeaway: there is no single golden rule for optimizing Multi-LoRA serving. The best configuration—whether it involves LoRA rank, target layers, or memory allocation—depends heavily on the specific requirements and constraints of the service scenario. For some use cases, maximizing throughput might take precedence, while others may prioritize minimizing latency or enhancing usability.
Ultimately, achieving the best performance requires carefully tuning both the LoRA configurations and the Multi-LoRA serving options. By understanding the nuanced behavior of frameworks like vLLM and TensorRT-LLM, developers can identify the optimal setup for their unique workloads, enabling efficient and scalable LLM-based solutions tailored to their needs.
Stay tuned for more insights in the vLLM vs TensorRT-LLM series!
Share article
Join the SqueezeBits newsletter today!