Are you getting everything out of your GPUs?
At the 2024 GTC event, Nvidia CEO Jensen Huang got on the stage to deliver his keynote speech, in which he divulged the newest GPU generation, the ‘Blackwell.’ In his speech, he claimed that within the last 8 years leading up to Blackwell, AI computation enhancements of 1000 times were achieved thanks to the advancements of the GPU chip. He presented a visual timeline showing the exponential growth that started from the 2016 Pascal generation to the 2017 Volta, 2020 Ampere, 2022 Hopper, and 2024 Blackwell generation.
Bill Dally, the chief scientist at Nvidia, has previously supported Huang’s assertion. At the 2023 HotChips event, Dally highlighted the 1000 times chip inference performance growth throughout the last decade, from the Kepler (K20X) to the Hopper(H100). As such, Nvidia’s continuous claim that it has ended Moore’s law(computing capability growing at a constant rate) and has moved on to ‘Huang’s law’(computing growth is no longer linear) seems to have become an established fact.
Yet 1000 times improvement(with an emphasis on “1000”) is quite a considerable and momentous number. It almost seems like an enticing hoax, much like an exaggeration. It is hard to believe that in the short span of time of 8–10 years, technological strides of that scale are achievable for a single chip. It is only natural for skeptics to wonder, ‘Where does the 1000x improvements originate from?’ ‘Am I fully exploiting the GPU’s 1000x capabilities to the max?’ ‘How can my Nvidia GPU exhibit the improvements Huang claims?’
Dally explains that the gains of 1000x inference performance come from 4 different places, two of which are from hardware and the latter from software.
First, improvement in complex instructions. Change from the FMA instructions(Fused Multiply-ADD) on the Tesla architecture(2006) to the IMMA instructions(Integer Matrix Multiply Accumulate) on the Ampere architecture(2020) has allowed for bigger computations to be performed in a single complex instruction. This advancement has allowed for a 12.5 times increase in chip performance.
Second, advancements in nanometer production. Reducing 28nm in the K20X Kepler line down to 5nm in H100 induced a 2.5 times performance upgrade. Now that Blackwell uses the new TSMC 4NP node(a better version of 4nm), enhancements will go even a step further.
Adding up the gains in nanometer production and complex instructions, the hardware enhancements only account for roughly 31x of the entire improvement. Where are the remaining gains that make up 1000x? Dally states that it comes from number representation and sparsity, a triumph in software development.
The third gain, number representation, contributes 16 times of the improvement in inference performance. It is the biggest gain out of all other contributions. By representing key parameters of the neural network in a lower precision format, computations are accelerated while retaining accuracy. Before the Pascal architecture, single precision floating point of 32-bits was used. The Pascal GPU(P100) halved precision from 32-bits to 16-bits and the Hopper cut it down to 8-bits(FP8 format). Now, the new Blackwell will be able to support FP4 and FP6. In introducing the new FP4 format for the Blackwell at the 2024 GTC, Huang added, “The way we compute is fundamentally different.”
The last remaining contribution comes from network sparsity. Pruning out weights that have no value, called ‘structured sparsity’ has been introduced by the Ampere generation(A100). It can prune out 2 out of 4 weights and can instigate 2 times of improvement.
All in all, the four gains from complex instruction, nanometer production, number representation, and network sparsity come together to achieve a 1000x boost in single-chip inference performance. But more importantly, we must realize that the rise of accelerated computing for AI cannot be solely attributed to the ascendency of hardware(GPU). The development requires collective contributions of both hardware and software, as they are interdependent and intertwined with one another. As Nvidia describes it, researchers must design hardware and software in tandem, as it is a ‘full-stack innovation’. Huang said it himself — “The innovation isn’t just about chips. It’s about the entire stack.”
Thus, when Nvidia describes their feat of accelerated computing in 10 years, their progress has not been the sole accomplishment of hardware; but rather the accomplishment has always been contingent on the enhancement of software, namely quantization and sparsity. What is stated as ‘1000x’ improvement is largely based on the assumption that compression technology supports the hardware enhancements.
Henceforward, if maximizing AI inference performance by taking full advantage of Nvidia GPUs is your objective, implementing software improvements will be crucial. Only when you employ the compression techniques of quantization and pruning will you get the most out of your Nvidia GPU. If you are a critical consumer, don’t be gullible to the face value of ‘1000x’ — probe and dissect the origins of that number and ensure you’re getting your money’s worth on your chip by compressing your AI model.
We, Squeezebits, are in the business of making sure you’re getting everything out of your GPU. Interested in saving costs through maximum GPU exploitation? Contact us at info@squeezebits.com.
For more information on compression and SqueezeBits, please visit: