[Intel Gaudi] #4. FP8 Quantization

In this blog series, we thoroughly evaluate Intel's AI accelerator, the Gaudi series, focusing on its performance, features, and usability.

Minkyu Kim

Jan 13, 2025

Contents

Introduction How to Implement FP8 Quantization on Gaudi-2 Two Steps for FP8 Quantization: MEASURE and QUANTIZE Experiment Setup Hardware and Software Setup Results Gaudi-2 FP8 vs. Gaudi-2 BF16 vs. A100 BF16 Impact of KV Cache Quantization Scaling Methods Conclusion

Introduction

Gaudi, Intel's AI accelerator series, is designed to optimize deep learning workloads with efficiency and scalability. In our previous posts, we explored the overall concepts, architecture, design, and performance of Gaudi-2 hardware and its software, along with a comparative analysis against the A100 GPU. In this post, we will focus on one of Gaudi-2's key advantages over the A100: support for the FP8 format.

Figure 1. Bit representation of E4M3 and E5M2

FP8 is an emerging numerical format that reduces memory usage and computational requirements while maintaining sufficient precision for most of AI workloads. Among the most common FP8 formats are E4M3 and E5M2, both with 1 bit for the sign, as illustrated in Figure 1. E4M3 dedicates an additional bit to the mantissa compared to E5M2, enabling finer precision, while E5M2 extends the range to accommodate larger numerical values. Table 1 shows some examples of how E4M3 FP8 values align with the IEEE-754 standard.

Table 1. Examples of E4M3 formatted FP8 values

Gaudi-2 hardware natively supports both E4M3 and E5M2 formats. E4M3 format supports a configurable exponent bias of 3, 7, 11, and 15, with 7 as the default, enabling a range from -240 to 240 under the IEEE-754 standard. In comparison, E5M2 format uses a default exponent bias of 15 with wider range. Gaudi-3 hardware enhances these capabilities with finer and broader exponent-bias options, offering even wider range for E4M3.

Notably, FP8 support is a key advantage of Gaudi-2 over the A100 GPU, which lacks native FP8 capability. As summarized in Table 2, Gaudi-2 achieves peak performance of 865 TFLOPS in FP8, whereas 432 TFLOPS in BF16 on Gaudi-2 and 312 TFLOPS in BF16 on A100.

Table 2. Compute and Memory specs of NVIDIA A100, H100, Intel Gaudi-2 and Gaudi-3

How to Implement FP8 Quantization on Gaudi-2

Quantization on the Gaudi series leverages Intel Neural Compressor (INC), an open-source library for model compression developed by Intel. INC provides implementations of calibration and quantization methods that can be easily applied to existing models. It supports a wide range of popular compression techniques, including quantization, pruning, and neural architecture search, and is compatible with mainstream frameworks such as PyTorch, TensorFlow, and ONNX Runtime. As it is developed primarily by Intel, it is highly optimized for Intel hardware, including the Gaudi series.

INC supports a variety of quantization methodologies such as Weight-Only Quantization, FP8 Quantization, MX Quantization, Smooth Quantization, and many others. For FP8 quantization, Post-Training Quantization (PTQ) is supported and it can be applied to any PyTorch models by replacing each module (operation) with one that contains modified forward method or hooks for calibration or quantization. It can be easily achieved with a few lines of codes, as illustrated in Code 1.


from neural_compressor.torch.quantization import (FP8Config, convert, prepare, finalize_calibration)

config = FP8Config.from_json_file(os.getenv("QUANT_CONFIG", ""))
if config.measure:
    model = prepare(model, config)
elif config.quantize:
    model = convert(model, config)

output = model(model_inputs)

if config.measure:
		finalize_calibration(model)

Code 1. code snippet for FP8 quantization

Two Steps for FP8 Quantization: MEASURE and QUANTIZE

Post-Training Quantization is typically applied by analyzing the statistics of tensors within a model and then generating a quantized model using quantization bins based on those statistics. Using INC, this process involves two key steps: MEASURE and QUANTIZE.

1. MEASURE Step (Calibration)

During the MEASURE step, tensor statistics are collected with a smaller set of data called calibration dataset. The prepare function configures the model for calibration by replacing target modules with their corresponding patched versions defined in INC, which include forward hooks to collect the statistics. With the patched modules, the statistics are gathered during the model’s forward pass and stored in a file—via finalize_calibration—for later use in the QUANTIZE step. It is important to note that the modules to be quantized must be included in the patch target and have corresponding patched modules with the appropriate measure and forward methods defined. For example, in vLLM, KV cache is defined with a class named VLLMKVCache, designed to work with the paged attention mechanism. To quantize this module properly, a corresponding PatchedVLLMKVCache must be defined to match its implementation. If a module is not natively supported by INC, users can define and use a custom patched module to enable quantization.

2. QUANTIZE step (Quantization)

In QUANTIZE step, the saved calibration statistics are loaded to calculate optimal quantization scale values. INC offers various scaling methods, enabling users to select an appropriate method by considering both performance and accuracy. The patched modules for QUANTIZE step perform quantization on input tensors and utilize the low precision kernels. While the additional quantize and dequantize operations may seem to introduce type casting overhead, SynapseAI's graph compiler optimizes them by eliminating unnecessary consecutive QDQ operations whenever possible, ensuring that activation tensors remain in low precision throughout the workflow. SynapseAI provides a feature to visualize intermediate graphs during the compiling process. Using this feature, Figure 2 illustrates the graphs before and after QDQ elimination.

Figure 2. Graph visualization before and after QDQ elimination

To achieve competitive accuracy, selecting an appropriate quantization scheme is essential. For additional reliability, INC offers a parameter called the backoff factor, which adjusts the scales by dividing them to better handle outliers that exceed the observed calibration range. The default backoff values are 0.25 for inputs and 0.5 for weights, providing a strong baseline for most workloads.

All of these configurations, including the quantization scheme and backoff factors, can be set through a config file which is used to initialize the FP8Config instance. It’s also available to control quantization targets using whitelist and blacklist. Code 2 provides an example of a config file in QUANTIZE mode that quantizes all possible modules except VLLMKVCache, using the maxabs_hw scaling method.


{
    "mode": "QUANTIZE",
    "observer": "maxabs",
    "scale_method": "maxabs_hw",
    "whitelist": {
        "types": [],
        "names": []
    },
    "blacklist": {
        "types": ["VLLMKVCache"],
        "names": []
    },
    "dump_stats_path": "./artifacts/artifact"
}

Code 2. An example config file for FP8 quantization

Experiment Setup

We conducted experiments to evaluate the performance of FP8 on Gaudi-2 compared to BF16 on both Gaudi-2 and A100. Additionally, we assessed the impact of KV cache quantization by leaving the KV cache in BF16 format, and analyzed performance of different scaling methods supported in INC. The experiments were conducted using vLLM, which officially supports HPU(Gaudi) backend.

Hardware and Software Setup

Hardware: Intel Gaudi-2, NVIDIA A100-SXM4 80GB

Gaudi SDK: SynapseAI v1.19.0

INC: 3.1.dev61+gb1964323c7

Framework: vLLM v0.6.4

Model: Llama-3.1-8B-Instruct

Dataset: Fixed random dataset

Gaudi has been supported in vLLM since v0.6.4. However, we used a forked repository maintained by Intel to leverage the latest updates for Gaudi. To ensure a fair performance comparison, we conducted experiments using randomly generated fixed-length datasets, varying the input length at 1K and 4K tokens, while keeping the output length fixed at 1K tokens by ignoring the EOS token. For accuracy evaluation, we conducted additional experiments using lm-eval-harness. The maximum batch size was set to 256, with concurrency varied from 32 to 256 in increments of 32 to analyze metrics across different batch sizes.

Results

Gaudi-2 FP8 vs. Gaudi-2 BF16 vs. A100 BF16

As observed in the previous post, Gaudi-2 has generally outperformed A100 in BF16 since the SynapseAI 1.19 update. In this experiment, we demonstrate that Gaudi-2 achieves further performance improvements with FP8 quantization. maxabs_hw scaling method was used throughout the experiments.

Figure 3. Throughput vs. TPOT plot for Gaudi-2 BF16, FP8 and A100 BF16 for 1K and 4K datasets

Figure 3 presents the throughput and TPOT results for 1K and 4K datasets, showing significant performance gains with FP8 over BF16 across all configurations. For 1K dataset, where BF16 also has sufficient memory to achieve the same running batch size as FP8, FP8 delivers higher throughput and lower TPOT for all batch sizes. FP8 computation accelerates operations for compute-bound scenarios, and performance of memory-bound region can be enhanced by reducing the memory footprint of the model and the KV cache by half using FP8.

Figure 4. Average running batch size of Gaudi-2 BF16 and FP8 for 1K, 2K, and 4K datasets

In 4K dataset, where running batch size of BF16 is limited by memory constraints, FP8’s advantages become more evident. As shown in Figure 4, at a concurrency level of 256 with 4K input length, BF16 delivers an average running batch size of 91, whereas FP8 can reach up to 144. Consequently, while the throughput of BF16 saturates beyond a concurrency of 96, FP8 continues to scale with larger batch sizes, delivering higher throughput. With the larger running batch sizes in FP8, it shows higher TPOT beyond the saturation point of BF16.

Figure 5. Prefill throughput plot for Gaudi-2 BF16, FP8 and A100 BF16 for 1K, 2K, and 4K datasets

We also measured prefill throughput, a compute-bound workload, to isolate and estimate the performance gain from FP8 computation alone, excluding benefits from reduced memory overhead. Across all datasets, FP8 demonstrates significant improvements over BF16, highlighting the superior speed of FP8 operations compared to BF16. However, unlike the A100, Gaudi-2 shows a tendency for prefill throughput to decrease as input length increases. This is mostly due to the limitation of PagedAttention in vLLM for Gaudi-2 and remains future work for further optimization.

Impact of KV Cache Quantization

Above experiment was conducted by quantizing all possible modules. However, in some cases, accuracy may degrade after quantization, requiring certain modules to be excluded from quantization and remain in high precision. INC provides a convenient way to handle this through blacklist in the config file. It can be specified either by individual module names or by type to assign all modules of a specific type. Code 2 illustrates an example where all VLLMKVCache modules are excluded from quantization using the type-based specification. In this experiment, we tested this feature to keep the KV cache in BF16 and evaluate its impact on performance.

Figure 6. Throughput vs. TPOT plot for Gaudi-2 BF16, FP8, and FP8 with BF16 KV for 1K and 4K datasets

As shown in Figure 6, whether the KV cache is quantized or not significantly affects performance. When the KV cache remains in BF16, memory bandwidth requirement increases, which leads to degraded performance in memory-bound scenarios. Additionally, since matmul operations are quantized, this setup introduces additional casting overhead, further degrading performance. Figure 7 shows that prefill performance, which is not critically affected by KV cache, is comparable to full FP8 quantization, demonstrating that operations like matmul are executed in FP8.

Figure 7. Prefill throughput plot for Gaudi-2 BF16, FP8, and FP8 with BF16 KV for 1K, 2K, and 4K datasets

Scaling Methods

INC supports various scaling methods for FP8 quantization, including:

maxabs_hw: Stretch/compress the maxabs measurement to the full-scale of FP8 and then replace it by an appropriate HW-accelerated scales.

maxabs_pow2: Stretch/compress the maxabs measurement to the full-scale of FP8 and then rounded to the power of 2.

act_maxabs_hw_weights_pcs_maxabs_pow2: Calculates weight scales per channel, rounded to the power of 2, while using maxabs_hw for activations.

act_maxabs_pow2_weights_pcs_opt_pow2: Calculates weight scales per channel with hardware accelerated scales and uses maxabs_pow2 for activations.

More methods and detailed explanations can be found in the Gaudi Documentation. maxabs_hw was used in the previous experiments, and in this experiment, we evaluated the performance of different scaling methods. For experiments with different scaling methods, Softmax modules were excluded from quantization due to errors occurred with some scaling methods. It is expected to be resolved as INC continues to be updated. Overall, as illustrated in Figure 8, there were no significant performance differences between the methods. However, using HW-accelerated scales was faster than using power-of-2 scales in general, and using per-channel scaling for weights showed a slight performance advantage over using per-tensor for both activations and weights.

Figure 8. Throughput vs. TPOT plot for different scaling methods for 1K dataset

While our previous experiments primarily focused on performance metrics such as speed, this time we shifted our attention to evaluating accuracy. Using lm-eval-harness, we tested the model’s capabilities across various domains through tasks such as MMLU, GSM-8K, Winogrande, and TruthfulQA. For all scaling methods, the Ultrachat dataset was used for calibration, and the results were measured using 5 few-shots across all tasks.

Table 3. Accuracy results of LM-eval-harness for different scaling methods

As shown in Table 3, all scaling methods exhibited slight accuracy degradation compared to BF16, with GSM-8K showing a more noticeable degradation. Generally, scaling methods using power-of-2 scales showed higher accuracy than those using HW-accelerated scales. While per-channel quantization for weights was theoretically expected to perform better than per-tensor quantization, the experimental results revealed only minor accuracy differences between the two, with no clear patterns or consistent trends. The achieved accuracy is considered competitive, especially given that all possible modules, except for softmax operations, were quantized and no additional optimizations, such as backoff factors, were applied. With INC actively evolving, more advanced algorithms are likely to be introduced, and further accuracy improvements are expected on Gaudi-3, which extends the range of the E4M3 format.

Conclusion

In this post, we explored the strengths of FP8 on Gaudi-2, highlighting its ability to significantly enhance speed with minimal accuracy degradation. In terms of ease-of-use, it can be easily implemented using INC, which is fully open-sourced. However, there are also limitations to utilizing the FP8 feature, with the most significant being the longer graph compilation time compared to BF16. As noted in our previous post, on-the-fly graph compilations became a critical issue after the v1.19 update, making it essential to carefully configure the warm-up process to prevent them. This issue is even more pronounced with FP8, as on-the-fly compilations in some cases can result in performance lower than BF16. Another caveat is the patching mechanism of INC, which requires patched modules to be pre-defined in INC for proper quantization. It’s important to verify that all target modules are properly patched, especially when using custom models.

If you’re interested in evaluating LLM performance on various hardware such as Gaudi series, take a look at our LLM serving benchmark tool, Fits on Chips! This tool allows for no-code, but in-depth evaluations on LLM benchmarking across different frameworks, models, and datasets. It empowers you to adjust settings and observe their impact on performance metrics. We’re also actively working to add support for vLLM on both Gaudi-2 and Gaudi-3, further enhancing the ability to compare devices and frameworks. Discover more about Fits on Chips here:

Fits on Chips

Discover the best LLM deployment parameters for your Chips.