How to Quantize YOLO models with OwLite

This article describes the experimental results of quantized YOLO models with OwLite.
Daehyun Ahn's avatar
May 07, 2025
How to Quantize YOLO models with OwLite

Introduction

Recent advances in AI models have expanded their applicability across a wide range of tasks, leading to a rapid increase in model complexity and size. As model grows larger, deploying them efficiently in real-world scenarios—especially in latency-sensitive tasks like real-time object detection—has become increasingly challenging. To meet strict latency and memory constraints on the scenarios, model compression has become a crucial requirement for practical deployment. Among various compression techniques, quantization, particularly INT8 quantization, has emerged as one of the most widely used methods. INT8 quantization offers several benefits of reducing model size and computational load while maintaining competitive model performance. Moreover, because INT8 arithmetic units are natively supported on a wide range of hardware platforms, including GPUs, NPUs, and edge devices, INT8 quantization is highly practical and broadly applicable for real-world deployments of AI models.
Despite its effectiveness, the application of INT8 quantization generally requires specialized expertise, making it challenging for non-experts to implement successfully. To address this barrier, SqueezeBits’ OwLite offers a user-friendly toolkit that enables AI model compression through an intuitive interface. OwLite further lowers the entry threshold by providing a “Recommended Settings” feature, which encapsulates the best practices derived from extensive model compression experience.
One of the central challenges in model quantization is maintaining model accuracy while reducing size and latency. Users may need to iteratively tune PTQ (Post-Training Quantization) calibration options to achieve optimal results or, in certain cases, apply Quantization-Aware Training (QAT) for further improvements. OwLite provides strong support for both approaches, offering additional calibration methods beyond those available in TensorRT for finer control, as well as a full QAT workflow to help preserve FP16-level accuracy even after INT8 quantization. These features make OwLite especially well-suited for compressing real-time object detection models such as YOLO.
To assess the effectiveness of OwLite, we conducted a series of compression experiments on YOLO models, comparing the performance of compressed models to their original versions. Using both TensorRT and OwLite toolkits, we aim to identify and analyze the optimal compression strategies.

How does INT8 quantization differ between TensorRT and OwLite?

PTQ Calibration method

One of the most important factors for preventing accuracy degradation during quantization is determining the optimal step size for each node to be quantized. The calibration step analyzes the distribution of weights and activations based on actual data to accurately compute the step size for quantization during the post-training quantization process. Since the computed step size varies depending on the calibration method, choosing the appropriate calibration method is essential to achieve the highest post-quantization accuracy.
TensorRT provides several calibrators for INT8 quantization, including Legacy, Entropy, Entropy2 (default), and MinMax. In constrast, OwLite extends these options by supporting additional methods such as Percentile and MSE, as shown in Figure 1. MSE method computes the step size to minimize the mean-square-error between the original value and dequantized value. Percentile method sets the range based on the bottom and top (100-N)% of tensor values (N = 99.99 by default). In addition, OwLite allows setting different calibration methods per layer, enabling finer-grained PTQ. For example, you can apply the MSE calibrator by default, but selectively use the Percentile or MinMax methods for specific layers where outlier weights or activations are critical. This per-node calibration flexibility may give the opportunities to further improve model accuracy. By offering calibration options not available in TensorRT, OwLite enables more effective and fine-grained quantization, leading to better trade-offs between speed and accuracy.
Figure 1. The supported PTQ Calibration method (Left) and QAT backward method (Right) in OwLite.
Figure 1. The supported PTQ Calibration method (Left) and QAT backward method (Right) in OwLite.
To evaluate the impact of different PTQ calibrators on model accuracy, we conducted a quantization experiment using YOLOv7, applying each calibrator supported by TensorRT and OwLite. As shown in Table 1, OwLite’s MSE and Percentile calibration methods achieved higher mAP performance compared to TensorRT’s MinMax and Entropy2 methods, while maintaining the acceleration benefits. The best quantized model was obtained using OwLite’s MSE calibration, achieving only a 0.5% decrease in mAP compared to the TensorRT FP16 model, along with approximately a 2x speed-up.
In summary, by supporting advanced calibration options such as MSE and Percentile and allowing per-node customization, OwLite provides an environment for achieving more optimal quantized models.
Table 1. YOLOv7 Quantization results with different calibration methods. Legacy and Entropy calibrators in TensorRT could not be applied to YOLOv7 because of unexpected error.
Table 1. YOLOv7 Quantization results with different calibration methods. Legacy and Entropy calibrators in TensorRT could not be applied to YOLOv7 because of unexpected error.

Quantization-aware Trianing (QAT)

When PTQ does not fully meet the expected accuracy of the model, developers may consider Quantization-Aware Training (QAT) to further improve performance. While TensorRT does not support QAT, OwLite provides a complete QAT workflow by enabling developers to finetune the quantized PyTorch models OwLite produces. As shown in Figure 2, developers can choose between two QAT backward methods—STE and CLQ—and configure the gradient scale value per node. By experimenting with different QAT configurations and training hyperparameters such as learning rate, weight decay, and the number of epochs, developers can significantly enhance model accuracy after quantization.
Further details on OwLite’s supported PTQ calibration and QAT backward methods are available in the official documentation.

Experimental Results

Experimental setup

We evaluated OwLite's performance across YOLO-family models (i.e. YOLOv5~8, and YOLOX) with four quantization manners, TensorRT-FP16, TensorRT-INT8, OwLite-INT8-PTQ and optionally OwLite-INT8-QAT.
TensorRT-FP16 and TensorRT-INT8 models was built with TensorRT Polygraphy toolkit by enabling fp16 and/or int8 option. The ENTROPY_CALIBRATION_2 calibrator was used by default, except for YOLOv6-families, where MIN_MAX_CALIBRATION was selected due to its superior mAP. For OwLite INT8 quantization, we used the recommended compression setting on OwLite. OwLite automatically analyzes the model architecture and selects the nodes to be quantized, optimizing for both latency and memory requirements. The MSE calibration was used for PTQ (except YOLOv6, where percentile (99.9%) was applied) and CLQ was applied for QAT. The environment we used for the experiments are as below.
OwLite version: 2.1.0 PyTorch: 2.1.2 Python: 3.10 TensorRT Evaluation GPU: A6000 Dataset: COCO’17 Dataset Image size: 640x640 Calibration images : 256
Detailed tutorial and usage guidelines for OwLite can be founded in our blog and official documentation.

YOLOv5

Figure 2. TensorRT Latency vs. COCO17 mAP[0.5:0.95] of YOLOv5-families and its quantized models on A6000 with batch size = 32.
Figure 2. TensorRT Latency vs. COCO17 mAP[0.5:0.95] of YOLOv5-families and its quantized models on A6000 with batch size = 32.
For YOLOv5 models, we compared the performance of four compressed models, TensorRT-FP16, TensorRT-INT8, Owlite-INT8-PTQ and Owlite-INT8-QAT, as shown in Figure 2. OwLite-INT8-QAT was finetuned from OwLite-INT8-PTQ with learning_rate=1e-4, weight_decay=1e-4 for 4 epochs.
OwLite-INT8-QAT achieved the best trade-off between mAP and latency among the quantized models, delivering around 2x speed-up over TensoRT-FP16 while maintaining mAP. While TensorRT-INT8 models showed similar latency to OwLite-INT8-{PTQ, QAT} models, their mAP was 5-10 % lower. OwLite’s MSE PTQ method of can fully recover the mAP, and its QAT method support further boosted mAP compared to PTQ alone.

YOLOv6

Figure 3. TensorRT Latency vs. COCO17 mAP[0.5:0.95] of YOLOv6-families and its quantized models on A6000 with batch size = 32.
Figure 3. TensorRT Latency vs. COCO17 mAP[0.5:0.95] of YOLOv6-families and its quantized models on A6000 with batch size = 32.
Figure 3 describes the latency and mAP of YOLOv6-families and their quantized versions on A6000. Both TensorRT- and INT8-PTQ models achieved approximately 2x lower latency than FP16 models. However, OwLite-INT8-PTQ delivered slightly better mAP compared to TensorRT-INT8.
OwLite currently does not support QAT for YOLOv6 models due to the reparameterization technique used during their training. Nonetheless, OwLite’s percentile PTQ method successfully recovered mAP while maintaining fast inference.

YOLOv7

Figure 4. TensorRT Latency vs. COCO17 mAP[0.5:0.95] of YOLOv7-families and its quantized models on A6000 with batch size = 16.
Figure 4. TensorRT Latency vs. COCO17 mAP[0.5:0.95] of YOLOv7-families and its quantized models on A6000 with batch size = 16.
Figure 4 shows the the latency vs. mAP trade-offs of YOLOv7 and YOLOv7-X model and their quantization models. TensorRT-INT8 quantization caused up to a 13% drop in mAP for YOLOv7 and 5% for YOLOv7-X, whereas OwLite’s MSE-PTQ calibration maintained near-FP16 mAP with ~2x speed-up.
Similar to YOLOv6, the reparameterization technique used for training YOLOv7 models limits the QAT, but OwLite’s MSE PTQ was sufficient to preserve the performance of the models.

YOLOv8

Figure 5. TensorRT Latency vs. COCO17 mAP[0.5:0.95] of YOLOv8-families and its quantized models on A6000 with batch size = 32.
Figure 5. TensorRT Latency vs. COCO17 mAP[0.5:0.95] of YOLOv8-families and its quantized models on A6000 with batch size = 32.
Figure 5 describes the experimental results for YOLOv8 family models, comparing TensorRT FP16, TensorRT INT8, and OwLite INT8-PTQ versions. Across all model sizes from nano to xlarge, OwLite-INT8-PTQ showed only 1.2-1.6 % accuracy degradation, outperforming TensorRT-INT8 models, which suffered 2.5–3% degradation while offering similar latency.

YOLOX

Figure 6. TensorRT Latency vs. COCO17 mAP[0.5:0.95] of YOLOv8-families and its quantized models on A6000 with batch size = 16.F
Figure 6. TensorRT Latency vs. COCO17 mAP[0.5:0.95] of YOLOv8-families and its quantized models on A6000 with batch size = 16.F
igure 6 presents performance comparisons for the YOLOX family. TensorRT-INT8 models experienced about a 3% mAP drop , whereas OwLite-PTQ reduced the accuracy gap to 0.3–0.6% compared to their FP16 baselines.
Across YOLOv5 to YOLOX models, OwLite consistently outperformed TensorRT-INT8 quantization in terms of maintaining mAP while achieving similar speed-up. OwLite’s MSE and Percentile calibration methods played a critical role in minimizing accuracy degradation. Furthermore, the availability of QAT support in OwLite enabled additional mAP gains, especially for YOLOv5 models.
Detailed experimental results on YOLO quantization can be founded in the owlite-examples repository. By following the example codes in this repository, you can also experience the improved performance of YOLO models and even other vision models with OwLite’s quantization.

Conclusion

Object detection models like YOLO are often deployed in real-time applications, where low latency and high throughput are non-negotiable. To meet these demands, INT8 quantization has become a critical technique for reducing model size and inference time. However, implementing INT8 quantization typically requires deep understanding of both model internals and calibration mechanics.
OwLite addresses this challenge by offering a streamlined, accessible, and high-performance quantization toolkit. It supports advanced calibration options—MSE and Percentile—that are not available in the TensorRT framework, enabling models to maintain higher accuracy after compression. Furthermore, OwLite uniquely supports Quantization-Aware Training (QAT), allowing developers to finetune quantized models for even greater performance.
Our experiments across YOLOv5 to YOLOX consistently demonstrated that models quantized with OwLite achieved superior mAP and similar latency compared to those compressed using TensorRT. In particular, OwLite’s QAT support for YOLOv5 and advanced calibration for YOLOv7, v8 and X led to minimal accuracy loss while significantly reducing inference time.
Beyond performance, OwLite provides a user-friendly interface that lowers the barrier to entry for quantization. Users can apply quantization without writing or modifying the complex code, and can easily select which nodes to quantize for customized or recommended compression. This makes OwLite suitable not only for experienced ML engineers, but also for developers new to model optimization.
In brief, OwLite empowers users to build lightweight, high-speed, and low-power AI models without sacrificing accuracy or requiring deep technical knowledge. By making advanced quantization techniques both accessible and effective, OwLite enables broader and more efficient deployment of AI models across real-world applications.
Ready to experience fast, accurate, and efficient AI deployment?
👉 Try OwLite and unlock the full potential of quantization.
 
Share article
Join the SqueezeBits newsletter today!

SqueezeBits