[vLLM vs TensorRT-LLM] #13. Vision-Language Models

This article provides a comparative analysis of serving vision-language models on vLLM and TensorRT-LLM.

Jan 20, 2025

[vLLM vs TensorRT-LLM] #13. Vision-Language Models

Contents

Introduction Vision-Language Model What is a Vision Encoder?Serving Vision-Language Models Experiment Setup Impact of Image Input Overview Results Implications User Scenario-based Evaluation Scenario #1. Fixed Input and Output Text Length Scenario #2: Scaling Output Complexity with Image Count Conclusion

Introduction

Large Language Models (LLMs) are evolving rapidly, with multimodality emerging as a key focus of development. For example, Vision Language Models (VLMs) process and integrate text and image data, enabling more versatile AI systems. This progress has driven advancements in real-world applications such as AI assistants, visual search engines, content moderation, medical diagnostics, and autonomous systems.

Meta’s LLaMA series exemplifies this shift. While earlier versions were limited to text processing, LLaMA 3.2 introduced vision capabilities, marking a significant step toward comprehensive multimodal understanding. This aligns with the broader industry push to develop models that handle complex, cross-modal tasks.

As these models grow in size and complexity, serving them efficiently in production has become increasingly critical. Beyond their high computational demands, complex data pipelines and the need for low-latency inference add further challenges to deployment.

In this blog post, we compare the serving performance of two leading frameworks—vLLM and TensorRT-LLM—using LLaVA-1.5-7B-HF, one of the most popular model with vision and language capabilities. While our experiments focus on LLaVA, the insights extend to serving VLMs more broadly. This evaluation aims to provide practical guidance for choosing the right serving framework based on specific deployment needs.

Vision-Language Model

What is a Vision Encoder?

Figure 1. Data processing pipeline of LLaVA model for single text and image inputs.

Unlike traditional LLMs, which process and generate only textual data, VLMs integrate visual understanding through a vision encoder. This component allows VLMs to handle tasks that combine image and text inputs, such as caption generation, visual question answering, and image-grounded dialogue.

A vision encoder processes raw image data and converts it into dense, high-dimensional visual embeddings. These embeddings capture essential visual information—such as shapes, colors, textures, and higher-level concepts (e.g., identifying objects like "cat" or "car")—which can be combined with text inputs to enable reasoning across multiple data types.

Most modern vision encoders, including those used in LLaVA, rely on Vision Transformer (ViT) architecture. Visual embedding outputs from these encoders are projected into the same embedding space as text tokens, forming image tokens. As shown in Figure 1, the LLaVA-1.5-7B-HF model converts image input into image tokens, which are then integrated with text tokens to generate coherent, contextually relevant outputs.

Serving Vision-Language Models

Visual Language Models are designed to process and reason over text and image data inputs simultaneously. However, efficiently serving these models in production presents significant challenges, including:

Large model sizes that demand significant memory and compute resources.

Latency sensitivity in real-time applications.

Complex data pipelines for handle and adjusting multiple modalities.

Figure 2. Two different pipelines for processing two concurrent images when input text length and output text length are fixed.

While challenges such as large model sizes and latency sensitivity are similar to those faced by traditional LLMs, complex data pipelines introduce unique difficulties for serving VLMs. For example, Figure 2 illustrates two different scenarios where two concurrent images are processed. Even though the total number of images being processed is the same, how images are handled in each request can significantly impact the model performance. Thus, VLM serving factors need to be carefully tuned to maximize throughput and minimize latency.

These challenges highlight the importance of high-performance serving frameworks like vLLM and TensorRT-LLM. Both frameworks are optimized for large-scale model inference, offering solutions for handling the computational demands of VLMs while maintaining high performance in production environments.

Experiment Setup

Model and Hardware Specification

Model: LLaVA-1.5-7B-HF, FP16

Hardware: Intel(R) Xeon(R) CPU @ 2.20GHz, 1 x NVIDIA A100-SXM 80G GPU

Benchmark Dataset

Text Input and Output: Fixed dataset with controlled input and output token lengths

Image Input: Fixed image size of 336×336 (maximum supported by LLaVA’s vision encoder)

Request Configuration: Number of images per request varied between 0 to 4, with a fixed batch size of 128 requests for consistency

Framework Version

vLLM: v0.6.3

TensorRT-LLM: v0.15.0 / Triton Server: v2.52.0

Metric Definitions

Throughput: Number of tokens processed per second

FPS (Frames Per Second): Defined as: FPS = Sequence Throughput X Number of Images per Request

Max Concurrent Image: Calculated as: Max Concurrent Image = Max Concurrent Request X Number of Images per Request

Impact of Image Input

Dataset Configuration: Input/Output text length pairs of (128, 128), (128, 1024), (1024, 128), and (1024, 1024)

Image Input: Single image or no image per request

Overview

This experiment evaluates the impact of including image inputs on model serving performance across various input/output text lengths. The objective is to measure the overhead introduced by processing visual data alongside text.

Results

Figure 3. Throughput with max_concurrency=16 on 4 different datasets with and without image input in vLLM and TensorRT-LLM.

Figure 3 compares the throughput between requests with and without image inputs. Adding an image significantly degraded throughput across both frameworks. For the (128, 128) configuration, throughput decreased by 37.7% in vLLM and 43.4% in TensorRT-LLM. The degradation is more pronounced for shorter input/output lengths, as image processing introduces a fixed overhead relative to the smaller computational load of shorter texts.

Figure 4. TTFT with max_concurrency=16 on 4 different datasets with and without image input in vLLM and TensorRT-LLM.

Additionally, Figure 4 highlights the noticeable increase in TTFT, driven by two factors: Vision Encoder Processing and Longer Input Lengths. The vision encoder requires extra processing time to convert images into tokens, while longer input lengths increase prefill computation time.

For LLaVA-1.5-7B-HF, a single 336x336 image is converted into 576 image tokens, leading to a 5.5x increase in input length (from 128 tokens to 704 tokens) for datasets with short text inputs. Consequently, TTFT increased by up to 4.28x in vLLM and 5.08x in TensorRT-LLM, closely aligning with the increase in input length. Despite the overall slowdown, vLLM consistently exhibited faster TTFT than TensorRT-LLM across all configurations involving image inputs.

Implications

Including even a single image input can significantly impact performance, especially for latency-sensitive applications like VLM chatbots and AI-powered customer support systems. The performance overhead is more severe for tasks involving short text prompts, where image tokens disproportionately increase input length.

User Scenario-based Evaluation

To effectively evaluate the serving performance of vLLM and TensorRT-LLM, we designed experiments that reflect common use cases of Vision-Language Models (VLMs). These scenarios help us analyze how different factors, such as the number of image inputs and the scaling of output complexity, impact key performance metrics like Throughput, FPS, TTFT, and TPOT.

The purpose of these scenarios is to isolate specific aspects of VLM serving performance and identify trade-offs that arise in production environments. By recreating practical workloads, we can provide actionable insights for optimizing serving pipelines based on application requirements.

Each use case addresses a distinct workload:

Scenario 1 focuses on the impact of varying the number of image inputs while keeping input and output text lengths fixed.

Scenario 2 examines how performance changes when output complexity scales with the number of image inputs.

By breaking down these scenarios, we aim to highlight the key considerations for choosing the right serving framework and optimizing performance in real-world deployments.

Scenario #1. Fixed Input and Output Text Length

Dataset Configuration: Varying the number of images per request from 1 to 4, with the input and output text lengths fixed at 128 tokens.

Overview

This scenario examines the performance impact of increasing the number of image inputs while keeping text input and output lengths constant. The goal is to isolate the effect of processing additional images on system performance.

A practical example of this scenario is a CCTV abnormal action detection system, where multiple camera channels are analyzed in parallel to detect suspicious activities (e.g., unauthorized access, accidents, or crowd disturbances). Regardless of the number of camera images being processed, the system produces a fixed-length output like a status alert ("No abnormal action" or "Abnormal action detected") based on a fixed-length input (see Figure 2).

Results

Figure 5. Throughput and TPOT with various max concurrent requests in vLLM (left) and TensorRT-LLM (right). The numbers inside the figure indicate max concurrent requests. Different colors represent different number of input images.

As shown in Figure 5, requests with fewer images achieved better token throughput and faster TPOT. For a maximum of 64 concurrent requests, increasing the number of images from 1 to 4 reduced throughput by 2.89x for vLLM and 3.36x for TensorRT-LLM. This behavior is expected, as processing fewer images reduces the computational load on both the vision encoder and the LLM.

Figure 6. FPS and TPOT with various max concurrent images in vLLM (left) and TensorRT-LLM (right). The numbers inside the figure indicate max concurrent images. Different colors represent different number of input images.

To evaluate the model’s efficiency in processing images, we analyzed the FPS metric while varying the maximum number of concurrent images. Requests with more images per request achieved higher FPS and faster TPOT, indicating better overall image processing efficiency. While TensorRT-LLM consistently delivered slightly higher FPS than vLLM, the difference was marginal.

Implications

In scenarios with fixed text input/output lengths, such as CCTV abnormal action detection, placing multiple images in a single request improves overall efficiency compared to handling multiple images with separate requests. Although single-image requests maximize token throughput, handling multiple images in a single request leads to better FPS, enabling faster analysis of large workloads.

For production systems that handle high volumes of image data, optimizing requests to include multiple images significantly improves processing speed and system performance. This approach is particularly beneficial when text inputs and outputs can be efficiently formatted regardless of the number of images per request.

Scenario #2: Scaling Output Complexity with Image Count

Dataset Configuration: Output text length increases with the number of images in each request:

(1 image, 128 tokens), (2 images, 256 tokens), (3 images, 384 tokens), (4 images, 512 tokens)

Input Text Length: Fixed at 128 tokens

Overview

Figure 7. Two different pipelines for processing two concurrent images when output text length scales with the number of images in the request.

This scenario explores how serving performance is affected when output text length scales proportionally with the number of images in a request, as shown in Figure 7. This pattern is common in applications like multi-image storytelling or automated report generation, where additional image requires more detailed output.

Results

Figure 8. FPS and TPOT with various max concurrent images in vLLM (left) and TensorRT-LLM (right). The numbers inside the figure indicate max concurrent images. Different colors represent different number of input images.

As illustrated in Figure 8, requests with fewer images consistently achieved higher FPS across both frameworks. This is due to the batch size and auto-regressive decoding nature of LLMs, where outputs are generated token by token.

For instance, with a maximum of 4 concurrent images, processing four separate requests (each with 1 image and a 128-token output) in parallel, completes in 128 decoding steps. Conversely, processing a single request with 4 image input and a 512-token output must complete 512 sequential decoding steps, resulting in significantly slower processing.

Implications

When output length scales with the number of images, it is more efficient to process images individually rather than combining them into a single request. Under fixed maximum concurrent image condition, splitting larger requests into smaller ones allows the model to utilize larger batch sizes, reducing the total number of decoding steps and improving FPS.

For production workloads like dynamic report generation, optimizing the serving pipeline by splitting requests can significantly enhance performance and reduce processing time.

Conclusion

Efficiently serving Vision-Language Models (VLMs) like LLaVA-1.5-7B-HF is essential as they drive multimodal applications such as AI assistants, visual search, and dynamic reporting. This evaluation of vLLM and TensorRT-LLM highlights key trade-offs in throughput, latency, and scalability. Including image inputs increases input length and latency, especially for short text prompts, requiring optimized serving pipelines for latency-sensitive applications like VLM chatbots.

For fixed input and output lengths, batching multiple images in a single request improves FPS, making it more efficient for high-throughput tasks such as CCTV monitoring. Conversely, when output complexity scales with the number of images, splitting requests into smaller batches reduces decoding steps and improves FPS, which benefits use cases like storytelling and report generation.

While TensorRT-LLM excels in throughput and processing efficiency, vLLM consistently delivers faster TTFT with image inputs, making it better suited for latency-critical scenarios. Selecting the right framework depends on specific deployment needs, but achieving optimal performance ultimately requires a well-tuned serving strategy.

If you are interested in comparing different serving frameworks with different workloads, take a look at our LLM serving benchmark tool, Fits on Chips!

Fits on Chips

Discover the best LLM deployment parameters for your Chips.

https://fitsonchips.ai/

Join the SqueezeBits newsletter today!

See more posts

[vLLM vs TensorRT-LLM] #13. Vision-Language Models

Introduction

Vision-Language Model

What is a Vision Encoder?

Serving Vision-Language Models

Experiment Setup

Model and Hardware Specification

Benchmark Dataset

Framework Version

Metric Definitions

Impact of Image Input

Overview

Results

Implications

User Scenario-based Evaluation

Scenario #1. Fixed Input and Output Text Length

Overview

Results

Implications

Scenario #2: Scaling Output Complexity with Image Count

Overview

Results

Implications

Conclusion

More articles

The Missing Piece of TensorRT-LLM

The Rise and Fall of ONNX (feat. PyTorch 2.0)

[Intel Gaudi] #4. FP8 Quantization

[Intel Gaudi] #3. Performance Evaluation with SynapseAI v1.19