Disclaimer
This blog series is being written independently, without any input or influence from Intel. Our objective is to provide an unbiased evaluation by discussing both the strengths and the limitations of the Gaudi-2 accelerator as a third-party user. Really.
Introduction
In our previous post, we discussed the details of the graph compiler, a cornerstone of the software stack for the Intel Gaudi series. For those unfamiliar, Intel Gaudi is a family of accelerators designed specifically for AI workloads. These accelerators aim to provide a cost-effective solution for both AI model inference and training, characterized by their high memory capacity and efficient bandwidth utilization. To fully leverage these capabilities, the graph compiler plays a pivotal role in the software stack, orchestrating the compilation and optimization of computation graphs.
As we moved onto exploring the performance disparity between the NVIDIA A100 and Intel Gaudi-2 in inference tasks, we identified a key challenge: the underwhelming performance of Gaudi-2, particularly in the design of its attention kernel. However, with the release of SynapseAI version 1.19 and critical optimizations to vLLM for Gaudi, we are now seeing strong performance from Gaudi-2. In this post, we will present the updated performance metrics for Gaudi-2 in large language model (LLM) inference and discuss the enhancements that contributed to this significant improvement.
LLM Inference on Intel Gaudi
As outlined in our previous post, the Intel Gaudi series relies heavily on the graph compiler for pre-compiling computation graphs. This approach enables graph-level optimizations for improved efficiency and eliminates the need for a costly address generation unit in the architecture. However, it imposes a stringent requirement: computation graphs must be pre-compiled on the host system before any processing begins. This requirement poses a challenge for LLM inference, given its inherently auto-regressive nature, which leads to continuously varying input and output tensor shapes. Addressing this variability relies on bucketing, a mechanism we discussed in detail in our last post.
Beyond tensor shape dynamicity, another critical challenge in graph-compiler-based LLM inference is optimizing the attention kernel. In PagedAttention, key-value (KV) caches are managed at the page level, with a fixed number of tokens per page. Consequently, KV caches for a request can be non-contiguous in memory, making efficient access to these caches a significant hurdle.
Recent SynapseAI and vLLM releases for Intel Gaudi introduced a feature called Contiguous PagedAttention (Contiguous PA) to address these challenges. This feature leverages the design of Gaudi’s attention kernel, flatPA, which flattens the KV cache before attention computation. flatPA reduces the batch dimension with small overhead of additional matrix multiplications to improve the computation efficiency of overall attention kernel. This design pre-determines the indices for each KV cache page before computation to enable the reduction of the batch dimension. Using these indices, the attention results can be reshaped back into the batch dimension after computation. Contiguous PA builds on this by taking advantage of flatPA’s ability to reshape computation results back into the batch dimension. By ensuring that the KV cache is read in a contiguous manner through the use of pre-determined page indices, Contiguous PA achieves more efficient memory access and processing. As illustrated in Figure 1, Contiguous PA replaces several KV-cache gather operations with faster slice operations, significantly improving computation throughput.
Alongside Contiguous PA, another optimization—Pipelined PagedAttention (Pipelined PA)—further enhances performance by minimizing computation stalls during the softmax operations in the attention kernel. This feature pipelines scalar operations for softmax with matrix multiplications, effectively hiding the latency of scalar operations. As demonstrated in Figure 1, this approach significantly reduces computation stalls caused by the softmax process.
These advancements enable Gaudi-2 to consistently deliver competitive results against NVIDIA A100. In the following sections, we’ll delve deeper into these performance improvements and evaluate where Gaudi-2 excels, as well as its future potential.
Experiment Setup
To evaluate the inference performance of Gaudi-2, we conducted a series of benchmarks against its market competitor, the NVIDIA A100. Our analysis centered on throughput and time-per-output-token (TPOT) across varying sequence lengths—two critical metrics for measuring inference efficiency. While the experimental framework mirrored that of our previous post, the performance numbers for Gaudi-2 have seen significant improvements.
Benchmark Dataset
For static analysis, we used a fixed random dataset with static input and output lengths. End-of-sentence tokens were ignored to maintain a consistent output length. For dynamic evaluation, we used a custom-curated dataset, Dynamic Sonnet, specifically designed to test how LLM serving systems handle variable input lengths. In this case, the output length was determined dynamically, with generation ceasing either upon encountering an end-of-sentence token or reaching the maximum limit of 1024 tokens. Each dataset is labeled as
nK
, where n
indicates the maximum input length in multiples of 1024 tokens. All experiments involved 1024 requests, with the maximum batch size capped at 256.Software and Hardware Setup
- Framework: vLLM >= v0.6.4 (commit a5b7eae for Gaudi-2)
- Gaudi SDK: SynapseAI v1.19.0
- Model: Llama-3.1-8B-Instruct (BF16)
- Hardware: NVIDIA A100-PCIe 80G GPU, Intel Gaudi-2
The Gaudi-2 benchmarks were conducted on the development branch, as several features were still under active development. These features have since been released as v0.6.4.post2+Gaudi-1.19.0, available here.
Results
Fixed Dataset
Across all fixed datasets, ranging from 1K to 8K input tokens, Gaudi-2 consistently outperformed the NVIDIA A100. The performance advantage was particularly pronounced for shorter input sequences, where we observed a 30–40% gain in throughput. Although this gain diminished as the input length approached 8K, Gaudi-2 still demonstrated comparable efficiency. Notably, under the same concurrency levels, Gaudi-2 consistently achieved shorter TPOT, contributing to its improved throughput. These results indicate that Gaudi-2 has now achieved similar decoding performance to the A100, thanks to enhancements in the PagedAttention kernel.
An aspect of these experiments was that achieving this performance required modifications to the default bucket configuration. Specifically, we increased the maximum sequence length for both prefill and decode buckets. This adjustment was necessary to avoid on-the-fly graph compilations, which introduce significant latency overhead. The impact of such compilations was especially evident in this benchmark, where the KV-cache was fully utilized for each request, with input sequences of up to nK tokens and 1K output tokens. While this tuning unlocked the performance potential of Gaudi-2, it also highlights a potential challenge: the need for careful configuration to fully leverage its advantages compared to the A100.
Dynamic Sonnet Dataset
The fixed dataset benchmark may inherently favor Gaudi-2, as its relatively static workload aligns well with the strengths of the graph compiler. To provide a more balanced comparison, we conducted the same benchmarks using the dynamic dataset, which features varying input and output lengths. Surprisingly, even under these dynamic conditions, Gaudi-2 consistently outperformed the NVIDIA A100 across all Dynamic Sonnet datasets, ranging from 1K to 8K input tokens.
The performance gap in this case was narrower than in the fixed dataset benchmark. This difference can be attributed to the reduced prefill workload in dynamic datasets, as fewer tokens require prefill processing. Gaudi-2 continues to excel in the prefill stage compared to the decode stage, although there is still room for improvement in handling dynamic workloads. The reduced workload during the prefill phase diminishes the performance advantage provided by Gaudi-2’s optimized attention kernel for prefill, which avoids the costly
index_select
operation.Despite these nuances, Gaudi-2 demonstrated robust support for dynamic workloads, proving itself to be a viable option for LLM serving in real-world deployments.
Improved Attention Kernel
As discussed, the two primary features driving the significant performance improvements for Gaudi-2 in SynapseAI 1.19 are Contiguous PA and Pipelined PA. Both features target improving the attention kernel, which has historically been a challenge for Intel Gaudi due to its reliance on pre-compiled computational graphs. To better understand the source of these improvements, we benchmarked the FLOPs (floating-point operations per second) of the attention kernel for SynapseAI versions 1.18 and 1.19.
The results were striking—across most scenarios, the new attention kernel in SynapseAI 1.19 significantly outperformed its predecessor from version 1.18. The performance gains were especially pronounced as batch size or sequence length increased, where attention kernels dominate the overall computation time.
Memory Bandwidth Utilization
Observations on the attention kernel FLOPs confirmed that the attention kernel has seen significant improvements compared to its predecessor. However, does this represent the peak performance—or roofline—for Gaudi-2? To investigate further, we conducted a comparative analysis with the NVIDIA A100 to assess whether Gaudi-2 has reached its full potential, in terms of memory bandwidth.
The auto-regressive nature of the decode phase in LLM inference inherently exerts significant pressure on memory bandwidth, creating a substantial memory bottleneck. In this context, effective memory utilization becomes a critical metric for comparing different devices, as it reflects how well the kernel and hardware are optimized for the decode phase. While directly measuring memory bandwidth utilization is difficult, it can be estimated by dividing the amount of memory read per generation step by the latency of that step.
When comparing Intel Gaudi-2 and NVIDIA A100, memory bandwidth utilization becomes even more critical due to differences in memory size and bandwidth between the two devices. Figure 5 and 6 indicate that Intel Gaudi-2 still lags behind NVIDIA A100 in bandwidth utilization. This trend is even more pronounced with the Dynamic Sonnet dataset, which highlights the additional challenges Intel Gaudi-2 faces in handling dynamic workloads. Gaudi-2’s competitive performance appears to be largely attributable to its increased memory bandwidth. Despite its lower memory bandwidth utilization, the device achieves a competitive level of effective memory bandwidth overall. At the same time, it highlights that further optimization opportunities remain, particularly for memory-intensive computations like those in the attention kernel.
Final Thoughts
Since beginning our evaluation and optimization of vLLM for Gaudi-2 in May 2024, we have witnessed remarkable performance improvements. With the release of SynapseAI v1.19, we’re excited to report that Intel Gaudi-2 has achieved competitive performance in LLM inference compared to the NVIDIA A100 with vLLM. While there are still areas for improvement—such as advanced features like chunked prefill, splitwise processing, multi-modality support, and more—the foundational capabilities of the device are now well established.
At SqueezeBits, we are committed to pushing the boundaries of Intel Gaudi devices, enhancing both basic performance and advanced features. Looking ahead, our next post will focus on FP8 LLM inference natively supported by Intel Gaudi-2, exploring both model performance and serving efficiency.
If you’re interested in conducting your own comparisons, check out our LLM serving benchmark tool, Fits on Chips! Specifically designed for benchmarking LLMs, this toolkit enables precise configuration adjustments across various frameworks, streamlining the benchmarking process while providing detailed insights. With Fits on Chips, you can fine-tune settings and visualize their impact on performance. We’re also working on adding support for vLLM on both Gaudi-2 and Gaudi-3, making it easier than ever to compare devices and frameworks. Learn more about Fits on Chips here:
Stay tuned for more insights into the LLM serving capabilities of Intel Gaudi Series!
Share article
Join the SqueezeBits newsletter today!