How to Quantize Transformer-based model for TensorRT Deployment

This article describes the experimental results of quantized Vision Transformer model and its variants with OwLite.
Daehyun Ahn's avatar
May 20, 2025
How to Quantize Transformer-based model for TensorRT Deployment

Introduction

In recent years, Transformer-based models have become central to the advancement of deep learning. Their remarkable performance on various tasks have led to widespread adoption across domains such as natural language processing (NLP) and computer vision (CV). In NLP, the shift from encoder-decoder models to encoder-only (e.g., BERT) and decoder-only (e.g., GPT) variants, along with enhancements like Mixture-of-Experts (MoE) and Rotary Position Embedding (RoPE), has significantly improved model performance. Similarly, in CV, architectures like the Swin Transformer have extended the basic encoder-only structure of ViT through hierarchical and window-based mechanisms. These developments have made Transformers as the new standard in deep learning across both language and vision tasks.
However, high computational costs and architectural complexity of Transformer-based architecture limits their practicality for real-time applications or deployment on resource-constrained devices. Especially, complex architectural components—such as multi-head attention and RoPE, often introduced to boost model performance—make it difficult for frameworks like TensorRT to automatically identify operations for quantization or to apply fused INT8-optimized kernels. As a result, these models suffers from achieving effective accuracy-latency trade-offs through standard quantization workflows.
To address these challenges, OwLite offers a robust solution for compressing and accelerating Transformer models with minimal loss in accuracy. In this blog, we explore how INT8 quantization—one of the most effective and widely supported compression techniques—can be applied to various Transformer-based models for TensorRT deployment, and how OwLite achieves superior results compared to native TensorRT INT8 workflows.

Quantization of Vision Transformer (ViT)

Architecture Overview

Figure 1.  The model architecture of Vision Transformer (Reference: A. Dosovitskiy et al. “An Image is Worth 16X16 Words: Transformers for Image Recognition at Scale, ICLR 2021).
Figure 1. The model architecture of Vision Transformer (Reference: A. Dosovitskiy et al. “An Image is Worth 16X16 Words: Transformers for Image Recognition at Scale, ICLR 2021).
The Vision Transformer (ViT) adapts the Transformer architecture, originally developed for large language models (LLMs), to the domain of computer vision by dividing images into fixed-size patches and treating them as sequential tokens. As illustrated in Figure 1, each patch is flattened into a one-dimensional vector and augmented with positional encodings before being processed by standard Transformer encoder blocks, which consist of multi-head self-attention and MLP layers. This architecture enables ViT to capture global visual relationships more effectively than traditional convolutional neural networks (CNNs), particularly when trained on large-scale datasets.

Using TensorRT native Quantization

Table 1. ViT compression results on A6000 using TensorRT and OwLite.
Table 1. ViT compression results on A6000 using TensorRT and OwLite.
TensorRT is widely known for optimizing model inference on GPUs, particularly for conventional CV architectures. To evaluate its performance on Transformer-based models, we conducted a comparative study using OwLite. Table 1 summarizes the quantization results for ViT in terms of accuracy and latency.
When INT8 quantization was applied with TensorRT, the resulting model exhibited no change in either latency or accuracy, indicating that quantization was not effectively applied. This suggests that native TensorRT quantization is incompatible with the ViT architecture.
In contrast, OwLite successfully quantized the ViT model, achieving a 30% reduction in latency with only a 0.7% drop in accuracy. These results demonstrate that OwLite is capable of delivering real quantization benefits on Transformer-based vision models, where native TensorRT fails to apply effective compression.

OwLite’s Recommendation for Vision Transformer model

Figure 2. Part of quantization configuration of ViT model generated by OwLite’s Recommendation (a) and applied to all nodes naively (b).
Figure 2. Part of quantization configuration of ViT model generated by OwLite’s Recommendation (a) and applied to all nodes naively (b).
Although TensorRT failed to quantize ViT effectively, OwLite’s recommendation system identified an optimal configuration (Figure 2a), reducing model size by nearly 50% and latency by ~30%. In contrast, brute-force quantization—applying INT8 to all nodes indiscriminately (Figure 2b)—even increased latency by 1.3× compared to the recommended setup, due to loss of kernel fusion opportunities.
OwLite’s Recommendation feature automatically searches for the optimal configuration, allowing users to achieve the best compression-performance trade-off with minimal manual tuning.

Quantization of ViT Variants

1. CaiT: A Deeper Vision Transformer

Class-Attention in Image Transformers (CaiT) builds on ViT by introducing two key enhancements:
  • A Class-Attention Layer that improves interaction between the class token and image patches for more accurate classification.
  • A Layer Scaling mechanism that stabilizes training in deep networks, enabling better performance at greater depth.
These additions allow CaiT to capture global context more effectively than ViT while maintaining computational efficiency.
Table 2. CaiT compression results on A6000 using TensorRT and OwLite.
Table 2. CaiT compression results on A6000 using TensorRT and OwLite.
As with ViT, TensorRT INT8 quantization had no measurable effect on CaiT's performance. In contrast, OwLite effectively quantized the CaiT model, achieving a 20% reduction in latency with only a 0.6% drop in accuracy (see Table 2). This result underscores OwLite’s ability to scale its quantization capabilities to deeper and more complex Transformer architectures.

2. Swin Transformer

The Swin Transformer addresses ViT’s inefficiencies by introducing a window-based attention mechanism and hierarchical feature processing. Instead of global self-attention, Swin restricts attention to non-overlapping windows, significantly reducing computation. As the network progresses, it merges patches to build multi-scale representations, improving performance in tasks like classification, detection, and segmentation.
Table 3. Swin compression results on A6000 using TensorRT and OwLite.
Table 3. Swin compression results on A6000 using TensorRT and OwLite.
Consistent with previous models, TensorRT failed to apply effective quantization to Swin-B. OwLite, however, successfully quantized the model, reducing latency by 15% with just a 0.2% drop in accuracy as denoted in Table 3. Despite smaller gains than with ViT, theses results confirm OwLite’s compatibility with hierarchical Transformer architectures.

3. EVA-02: An Advanced Derivative of Swin-B

EVA-02 is a high-performance extension of Swin-B, maintaining its window-based attention and hierarchical design while leveraging large-scale pretraining to improve generalization. It builds directly on the Swin-B backbone but aims for greater transferability and robustness.
The original ONNX of EVA-02 included an unsupported SplitToSequence operation, which caused compatibility issues with TensorRT. We replaced this operation with an equivalent TensorRT-supported function, enabling successful inference and quantization with OwLite.
Table 4. EVA-02 compression results on A6000 using TensorRT and OwLite.
Table 4. EVA-02 compression results on A6000 using TensorRT and OwLite.
As shown in Table 4, TensorRT once again failed to apply effective quantization, with no measurable changes observed across any metrics. On the other hand, OwLite successfully quantized EVA-02, reducing latency by 24% and preserving accuracy within 0.8% of the FP16 baseline. This represents OwLite’s flexibility in adapting to state-of-the-art model variants with minor preprocessing.

Conclusion

Across all four Transformer-based models—ViT, CaiT, Swin-B, and EVA-02—OwLite consistently reduced latency by 15–30% while keeping accuracy loss within 1%. These results strongly demonstrate that OwLite is a best-in-class solution for INT8 quantization of Transformer architectures.
While frameworks like TensorRT often fail to apply INT8 quantization effectively to Transformer models due to architectural complexity, OwLite succeeds in both compressing and accelerating these models without sacrificing performance. This includes deep models (CaiT), hierarchical designs (Swin-B), and their cutting-edge variants (EVA-02).
Thanks to OwLite, real-time deployment of large-scale Transformer models is now practical, even on edge devices and latency-critical applications. If you're working with Transformer models and need to optimize them for speed and efficiency, OwLite is the tool to get the job done.
👉 Explore OwLite and start transforming your AI deployment pipeline today.
 
Share article
Join the SqueezeBits newsletter today!

SqueezeBits