OwLite Meets Qualcomm Neural Network: Unlocking On-Device AI Performance

At SqueezeBits we have been empowering developers to efficiently deploy complex AI models while minimizing performance trade-offs with OwLite toolkit. With OwLite v2.5, we're excited to announce official support for Qualcomm Neural Network (QNN) through seamless integration with Qualcomm AI Hub.

Eunik Park

Jul 03, 2025

OwLite Meets Qualcomm Neural Network: Unlocking On-Device AI Performance

Contents

Introduction Understanding Qualcomm's AI Ecosystem OwLite + Qualcomm: A Powerful Integration Conclusion

Introduction

AI is transforming not only cloud computing but also the countless edge devices around us—smartphones, tablets, IoT devices, and more. On-device AI offers compelling advantages: no network dependency, enhanced privacy protection, and reduced latency. This has driven widespread adoption of AI models on edge devices. Developers have strived to run AI models on edge devices using various SoCs (System on Chip) like Samsung's Exynos and Apple Silicon. Among these, Qualcomm's solutions, Snapdragon and Dragonwing, are widely used across diverse hardware platforms including mobile, automotive, and IoT. Qualcomm supports developers' on-device AI implementation through a comprehensive AI software stack including the Qualcomm Neural Processing SDK and Qualcomm AI Hub. However, edge devices face inherent constraints: limited computing power, battery life concerns, and thermal restrictions. These factors create significant challenges when deploying high-performance AI models, making optimization crucial for developers seeking peak performance within these constraints.

At SqueezeBits we have been empowering developers to efficiently deploy complex AI models while minimizing performance trade-offs with OwLite toolkit. With OwLite v2.5, we're excited to announce official support for Qualcomm Neural Network (QNN) through seamless integration with Qualcomm AI Hub. This integration enables developers to easily quantize and deploy models not only on GPUs but across a wide range of Qualcomm edge devices, unlocking the full potential of Qualcomm's specialized hardware accelerators.

Understanding Qualcomm's AI Ecosystem

What is Qualcomm Neural Network (QNN)?

QNN is Qualcomm’s cross-platform neural network runtime and graph format that converts trained models into hardware-optimized executables and dynamically schedules them across Qualcomm hardware. QNN, through the Qualcomm AI Stack, supports model execution on various processing engines including the Snapdragon CPU, Adreno GPU, and Hexagon DSP. In particular, by leveraging specialized processing units like the Hexagon DSP (NPU), QNN delivers significantly improved power efficiency and inference speed compared to CPU or GPU-only execution. This translates to faster AI experiences with longer battery life—essential for mobile and IoT applications.

Qualcomm AI Hub: The Control Tower of the Qualcomm AI Ecosystem

Qualcomm AI Hub serves as a unified platform that simplifies AI application development and deployment on Qualcomm hardware. It seamlessly converts PyTorch and ONNX models into on-device optimized formats including QNN, ONNX Runtime, and LiteRT.

Beyond conversion, the platform provides access to Qualcomm-hosted devices for real-world benchmarking and inference testing. This cloud-based testing environment allows developers to experiment with and validate their models on actual Qualcomm hardware before deployment, streamlining the development-to-production pipeline.

OwLite + Qualcomm: A Powerful Integration

OwLite, now integrated with the Qualcomm AI Hub, empowers developers to effortlessly optimize AI models using OwLite's user-friendly interface and powerful optimization features. It allows them to deploy these models while preserving maximum performance of the Qualcomm neural engine through various quantization techniques.

Seamless Setup Process

OwLite supports QNN with great simplicity. Users can connect to the installed OwLite Qualcomm Runner via the owlite device connect command and then simply specify the target Qualcomm device name in their existing OwLite-integrated code. You can find a list of supported devices in the patch notes.


import owlite

# When initializing OwLite, specify the target Qualcomm device name in the device parameter
owl = owlite.init(
    project="<your_project_name>",
    baseline="<your_baseline_name>",
    experiment="<your_experiment_name>",
    device="<target_Qualcomm_device_name>" # e.g., "Samsung Galaxy S24 Ultra"
)

Subsequently, all of OwLite's powerful features, such as model visualization, fine-grained customization, and built-in latency benchmarking, can be utilized seamlessly in the QNN environment.

OwLite's Tailored Recommendations for QNN

OwLite provides QNN-specific optimization recommendations tailored to Qualcomm hardware characteristics. Since the existing TensorRT environment and the QNN environment differ in their quantized kernel patterns, distinct quantization strategies are necessary. OwLite considers these differences to recommend the most suitable quantization settings for QNN models, helping users achieve optimal performance without complex deliberations.

Our expert-curated recommendations leverage deep understanding of QNN's execution patterns, automatically selecting quantization schemes that balance accuracy and performance for your specific target device.

Figure 1. Part of quantization configuration of ViT model generated by OwLite’s Recommendation for QNN (a) and TensorRT (b).

Bridging the Performance Gap

While the Qualcomm AI Hub is a very convenient platform, it also has some limitations. Notably, it only supports quantization for ONNX models, and the supported quantization methods are limited to basic PTQ techniques like Min-Max and MSE. As models become more complex, these simpler methods may not suffice for performance recovery. This is precisely where OwLite can effectively fill the gap through various PTQ and QAT algorithms.

The table below shows the results of quantizing an actual Vision Transformer (ViT) Base patch16 model using OwLite. While a PTQ method resulted in an accuracy drop of about 2.8%p, OwLite's QAT significantly reduced this gap to 0.85%p while achieving a 26% improvement in latency. This clearly demonstrates how effective OwLite's advanced quantization technology is in real-world edge device environments.

Table 1. ViT-B-16 compression results on Samsung Galaxy S24 Ultra. Applied OwLite Recommended Config with MSE as the PTQ calibration method and CLQ as the QAT backward method

Conclusion

OwLite's QNN support combines the powerful capabilities of the Qualcomm AI Hub with OwLite's user-friendly advanced optimization technology, enabling developers to push the performance of AI models on Qualcomm edge devices to their limits. It is now possible to easily quantize models without complex setups or specialized knowledge, and to derive optimal results by directly verifying performance on actual hardware.

SqueezeBits will continue to enhance OwLite to help developers overcome the challenges of AI model optimization and create innovative AI services. We’re also proud to have joined the Qualcomm AI Program for Innovators (QAIPI) and look forward to deepening our collaboration with Qualcomm through this initiative.

👉 Experience the OwLite and unlock the full potential of your Qualcomm-powered devices!

OwLite

AI compression got much easier

https://owlite.ai/

Join the SqueezeBits newsletter today!