Accuracy Degradation in AI Compression: Myth or Truth?

Clarifying the misunderstandings in AI model compression

Apr 24, 2024

Accuracy Degradation in AI Compression: Myth or Truth?

Contents

The benefits of neural network compression have been made clear(in our previous posts). As advocates of compression technology, we commend using compression methodology to enhance deep learning models. Yet, many skeptics still hesitate to utilize compression. Many are simply inclined to believe that as there is a known tradeoff between accuracy and model size, the reduction of model size through compression would undoubtedly come at the expense of accuracy. Although there is no denying that the model’s level of accuracy does not always stay precisely intact after compression, research suggests the necessity of rethinking preconceived notions and looking at them in a different light.

To begin with, building the ‘ideal’ deep learning model is much like solving a complex multi-objective optimization(MOO) problem (aka Pareto optimization). It is a multiple-criteria decision-making process where trade-offs between two or more conflicting objectives exist. In a MOO problem, choosing the set of options where each option simultaneously provides the highest functionality is impossible(known as the ‘utopian point’). Instead, we can opt for a set of options(the Pareto Optimal Set) with the best tradeoff between competing objectives. That is to say, when choosing a Pareto-optimal set, one objective cannot be made better without making another objective worse off.

With this in mind, ML engineers aim to find a Pareto-efficient point on the multi-dimensional Pareto frontier. Notable features of the multi-dimensional frontier would include performance quality(accuracy), computational costs(efficiency), memory footprint, and latency(computational speed). The challenge of synchronously achieving high functionality in accuracy, efficiency, and speed has already been heavily discussed, labeled as ‘GenAI’s Trilemma.’

In spite of the conventional belief that we must make compromises on the features, it is vital to understand that by employing compression, we are no longer moving along the existing Pareto frontier. Rather than having to gain an objective at the expense of another, with evolving technology, we are able to encourage a Pareto improvement. In this scenario, objectives can both gain, resulting in an outwards shift of the frontier curve closer to the utopian point.

To illustrate, the experiment from ‘Train Large, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers’ (Li, 2020), compares the accuracy levels of differently sized RoBERTa models post-compression. After being reduced down to a similar memory footprint, the drop in accuracy is smaller in models of originally larger size. As large models are more robust to methods of both quantization and pruning, heavy compression can take place on them without considerably hurting accuracy. Thus, the Pareto optimal solution in this circumstance would be to increase the model size and incorporate heavy compression rather than simply starting with a small-sized model.

RoBERTa Quantization and Pruning, source: link

In another experiment from ‘Pareto-Optimal Quantized ResNet Is Mostly 4-bit’ (Abdolrashidi, 2021), the authors study the impact of different quantization precisions on a bfloat16 ResNet50 baseline model. The experiment reveals that the 4-bit and 8-bit models, though more heavily quantized than the baseline model, outperform the bfloat16 model in terms of accuracy. To put it more precisely, the best compute cost-accuracy Pareto curve is found in 4-bit models with the first and last layers quantized to 8-bits(purple curve). This result negates the traditional thought that more compression simply equates to forfeiting more accuracy. It is valuable to acknowledge that in this context and also in many others, the more heavily quantized model is often more advantageous and optimal as it achieves lower compute costs while having innocuous effects on accuracy.

Consequently, drawing simple-minded conclusions based on the notion that compression unquestionably sacrifices accuracy in all circumstances is a misguided, erroneous assumption. In finding the optimal model, a multifaceted view of all feature settings including compression methods should be considered. Not to mention the empirical comparison of multiple compressed models (Hohman, 2023). A more innovative, attentive, and analytical approach is truly requisite in model optimization — something we, SqueezeBits, have considerable skilled experience and expertise in. We provide guidance and cutting-edge solutions to compress your model for optimization. Reach out to our professional services for AI model compression. Contact us at info@squeezebits.com!

References

[1] Multi-objective optimization. (2024, January 18). In Wikipedia. https://en.wikipedia.org/wiki/Multi-objective_optimization

[2]Pareto efficiency. (2024, March 11). In Wikipedia. https://en.wikipedia.org/wiki/Pareto_efficiency

[3] NEMO: A Novel Multi-Objective Optimization Method for AI Challenges. (2021, July 15). https://www.linkedin.com/pulse/nemo-novel-multi-objective-optimization-method-ai-challenges-miret/

[4] Conquering GenAI’s Trilemma: Performance, Efficiency and Speed. (2024, March 25). https://ai.plainenglish.io/conquering-genais-trilemma-performance-efficiency-and-speed-ee159f05f111

[5] Train Large, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers. (2020, June 23). https://arxiv.org/abs/2002.11794

[6] Pareto-Optimal Quantized ResNet Is Mostly 4-bit. (2021, May 7). https://arxiv.org/abs/2105.03536

[7]Model Compression in Practice: Lessons Learned from Practitioners Creating On-device Machine Learning Experiences. (2023, October 6). [2310.04621] Model Compression in Practice: Lessons Learned from Practitioners Creating On-device Machine Learning Experiences (arxiv.org)

Join the SqueezeBits newsletter today!