[Intel Gaudi] #2. Graph Compiler and Overall Performance Evaluation
In this blog series, we thoroughly evaluate Intel's AI accelerator, the Gaudi series, focusing on its performance, features, and usability.
Dec 02, 2024
Contents
DisclaimerIntroductionGraph Compiler for GaudiExecution ModesBucketing: Supporting Dynamic ShapesExperiment SetupBenchmark DatasetSoftware and Hardware SetupResultsOverhead of Graph Compilation and Bucketing: Warm-up time and memory usageOverall Performance: Throughput vs. Time-Per-Output-TokenWhat’s Next?Disclaimer
This blog series is being written independently, without any input or influence from Intel. Our objective is to provide an unbiased evaluation by discussing both the strengths and the limitations of the Gaudi-2 accelerator as a third-party user. Really.
Introduction
In our previous post, we introduced the basics of the Intel Gaudi series. For those who missed it, Intel Gaudi represents a compelling alternative accelerator for AI workloads, designed by Intel. The Gaudi series is engineered to provide a cost-effective solution for both training and inference of AI models. Its hardware design incorporates unique features, including larger matrix multiplication units for improved memory bandwidth utilization and native support for FP8 computation. Complementing the hardware, the software stack delivers comprehensive support for training and inference using the PyTorch library.
As highlighted earlier, a cornerstone of the Gaudi series’ software stack is its Graph Compiler. This tool mandates the pre-compilation of PyTorch operations into a computation graph, encompassing detailed specifications of input tensors, computational workflows, and output tensor handling to ensure best performance. However, working with the Graph Compiler poses a significant challenge—it introduces unique requirements that are critical for efficiently managing dynamic workloads.
In this post, we will explore the intricacies of the Graph Compiler and its impact on LLM serving frameworks. While we cannot dive deeply into the details of the Graph Compiler—since it is proprietary software and we do not have visibility into its actual implementation—we will focus on the benefits and limitations it introduces for AI model inference. Additionally, we will compare the serving metrics of vLLM on Gaudi-2 with those of NVIDIA A100, providing a clear view of where Gaudi-2 excels and where it faces challenges.
Graph Compiler for Gaudi
The Graph Compiler serves as the central component of Intel Gaudi's software stack, playing a pivotal role in translating high-level PyTorch models into optimized binary code ready for execution. The inference process for AI models on the Intel Gaudi series diverges significantly from inference on GPUs, primarily due to the requirement for pre-compiling computational graphs. While transitioning to the Intel Gaudi series is often straightforward, thanks to its extensive support for diverse PyTorch operations and open-source projects, achieving optimal performance requires a thorough understanding of the Graph Compiler.
As highlighted in our earlier post, the Intel Gaudi series takes a distinctive approach by off-loading critical computational tasks, such as memory management and graph optimization, to the Graph Compiler, which operates on the host device. This design choice enables leveraging graph-level optimizations, such as kernel fusion and parallelization, which are standard features in modern graph compilers for AI models.
Graph-level optimizations can yield substantial performance improvements by reducing host latency associated with kernel launches and memory write-backs. On NVIDIA GPUs, similar optimizations are done by the TensorRT framework, which now incorporates its proprietary Myelin backend. The Graph Compiler for the Intel Gaudi series offers comparable functionality, further enhancing computational performance.
In addition to graph-level optimization, the Intel Gaudi series also delegates memory management and data layout planning to the Graph Compiler. This design decision imposes a strict requirement for pre-compiling computational graphs, as the device cannot otherwise determine where to retrieve input tensors or store output tensors for computation. While this requirement may appear restrictive, it enables substantial imp rovements in memory bandwidth utilization by optimizing memory access patterns in advance of execution.
Because memory management is entirely handled by the Graph Compiler, the Gaudi series mandates pre-compilation of all computation graphs. This approach marks a notable contrast to NVIDIA GPUs, where each PyTorch operator can be launched eagerly without graph compilation. To accommodate this requirement, the Gaudi series introduces its own execution modes for handling computation operations.
Execution Modes
Computation graph execution on Gaudi is supported in two primary modes: Eager Mode and Lazy Mode, as illustrated in Figure 2.
- Eager Mode: This mode launches the given computation graph immediately. Eager Mode can incur substantial overhead due to frequent kernel launches when configured to launch every op by op. However, it is expected to support torch-compiled computation graphs using
torch.compile
in future updates.
- Lazy Mode: In this mode, the host accumulates PyTorch operations, compiles a batch of operations into a single computation graph, and then launches it. By allowing the host to aggregate multiple operations, Lazy Mode enables graph-level and memory optimizations, significantly boosting performance. However, even when computation graphs are pre-compiled, Lazy Mode introduces host overhead due to the need to accumulate operations and generate a graph hash to locate the corresponding recipe. To address this, Lazy Mode can be combined with the HPU Graph option, which allows explicit replay of pre-compiled computation graphs, reducing the overhead.
Currently, Lazy Mode is the standard execution mode for the Gaudi series, while the support for
torch.compile
adoption is under active development. To minimize host overhead caused by repeated compilation, computation graphs are hashed based on the static information of its inputs, such as tensor shape or data type, as well as the structure of the graph itself (i.e., the PyTorch ops and their connections). The Graph Compiler caches these pre-compiled graphs, reusing them when possible to reduce host overhead.Bucketing: Supporting Dynamic Shapes
Modern AI workloads, particularly those involving LLM models, often require efficient handling of dynamically shaped input tensors. This challenge arises because LLMs are typically auto-regressive, making it difficult to predict both the input length provided by users and the length of the final output. As discussed earlier, Gaudi requires pre-compiled computation graphs, which are hashed based not only on the PyTorch operations but also on the shapes of their input and output tensors. Consequently, specialized handling is needed for Gaudi to support computations involving dynamic shapes.
For Gaudi, the officially recommended approach to managing dynamic shapes is to employ bucketing, as illustrated in Figure 3. Users can predefine “buckets” based on expected input tensor shapes, using zero-padding to standardize all potential input shapes to fit within these predefined buckets. This method eliminates the need for costly on-the-fly computation graph compilation, provided the input shape aligns with a known bucket. However, this approach comes with a trade-off: wasted computation due to zero-padding. The expectation is that the reduction in compilation latency outweighs the overhead caused by unused computations.
The vLLM implementation for Gaudi adopts the same bucketing strategy to manage dynamic input shapes, addressing both the batch dimension and the sequence length dimension. However, this is where performance struggles arise when compared to vLLM running on NVIDIA A100. To better understand these differences, let’s dive deeper into performance experiments and comparisons.
Experiment Setup
We conducted experiments to evaluate the strengths and weaknesses of Gaudi-2 compared to its market competitor, the NVIDIA A100, in serving LLMs. Our analysis focused on measuring throughput and TPOT across varying sequence lengths to identify scenarios where Gaudi-2 outperforms and where it falls short.
Benchmark Dataset
For all experiments, we used a custom curated dataset: Dynamic Sonnet specifically designed to evaluate how LLM serving systems handle dynamic input. Each dataset is labeled as nK, where "n" represents the maximum input length in multiples of 1024 tokens. The output length is determined dynamically, with generation stopping either upon encountering an end-of-sentence token or when reaching the maximum limit of 1024 tokens. The experiments were conducted with 1024 requests, and the maximum batch size was set to 256.
Software and Hardware Setup
- Framework: vLLM v0.6.4
- Gaudi SDK: SynapseAI v1.18.0
- Model: LLaMA-3.1-8B-Instruct (BF16)
- Hardware: NVIDIA A100-PCIe 80G GPU, Gaudi-2
We made efforts to maintain consistent hardware and software settings across all experiments. However, please note that minor deviations may exist due to differences in host CPUs and the specific vLLM commit used for each device.
Results
Overhead of Graph Compilation and Bucketing: Warm-up time and memory usage
To assess the overhead of graph compilation for both GPUs with CUDA Graph enabled and Gaudi, we measured the warm-up time and the additional memory required to store pre-compiled graphs. The maximum batch size was set to 256, with maximum model lengths of 1K, 2K, 4K, and 8K tokens. To ensure fairness in comparison, the warm-up tuning parameter for Gaudi was adjusted to prevent any on-the-fly compilation, even when request preemption occurs.
The results revealed some interesting trends. As depicted in Table 1, Gaudi-2 required significantly less memory to store computation graphs for short sequences. However, as the maximum input length increased, its memory requirements grew substantially compared to the A100. Additionally, the warm-up time for Gaudi-2 was considerably higher than that of the A100. For example, launching an OpenAI API-compatible server on the NVIDIA A100 consistently used 1.65 GiB of additional memory for CUDA graphs, which took approximately 11 seconds to trace for all maximum input lengths. In contrast, Gaudi-2 used between 0.1 GiB and 7.1 GiB of memory to store computation graphs, with tracing times ranging from 96 seconds to 531 seconds.
The memory differences appear to stem from how each platform handles graph compilation. On GPUs, vLLM currently employs CUDA Graphs for the decode phase only, as larger batches gain less benefit from pre-compiled graphs. In contrast, Gaudi-2 requires pre-compilation for both the prefill and decode phases due to its dependency on the Graph Compiler. This means Gaudi-2 must pre-compile computation graphs for both phases, including all predefined buckets used for bucketing, resulting in substantial memory usage to store these graphs. Consequently, this requirement can make serving models with long sequences on Gaudi-2 more complex, as careful memory optimization is crucial to manage resources efficiently.
Warm-up time is also a critical factor in LLM serving scenarios, where servers are often reallocated dynamically based on live traffic. In this area, Gaudi-2 lags significantly behind, requiring multiple warm-up iterations. While some workarounds, such as caching computation graphs to disk, can help reduce warm-up time, further optimization is still needed to make Gaudi-2 more competitive in dynamic serving environments.
Overall Performance: Throughput vs. Time-Per-Output-Token
As we discussed in an earlier post, Gaudi-2 delivers comparable performance to the NVIDIA A100 in short sequence scenarios. Using Dynamic-Sonnet 1K subset with 1K maximum input tokens, we observed that Gaudi-2 achieved higher throughput than the A100 for smaller batch sizes with lower TPOT. This performance advantage can be attributed to several combined factors, but the most significant contributor is Gaudi-2’s improved matrix multiplication throughput, as the linear layer heavily influences overall performance in short sequence scenarios.
To validate the dominance of Gaudi-2 in matrix multiplication performance, we measured the prefill throughput by setting the maximum output length to 1 token. The results demonstrated that Gaudi-2 significantly outperformed the A100 during the prefill phase, where matrix multiplication is the primary driver of performance.
However, despite Gaudi-2's matrix multiplication advantage, an important observation remains: the A100 outperformed Gaudi-2 on short sequence datasets for larger batch sizes with longer TPOT. This result seems counterintuitive, as superior computational power should typically translate into better throughput.
To further evaluate whether Gaudi-2 can compete with the A100 in other scenarios, we extended the experiments by increasing the maximum model length to 3K and 5K and testing throughput and TPOT using datasets with 2K and 4K input tokens. As shown in Figure 6, Gaudi-2 exhibited performance degradation as the input sequence length increased.
Interestingly, as shown in Figure 5, Gaudi-2 maintained higher prefill throughput than the A100 for the 2K and 4K datasets. This suggests that the performance degradation observed in Gaudi-2’s serving capability is occurring primarily during the decode phase. Moreover, this degradation worsens as the input sequence length grows, likely due to the increasing computational burden of attention layers relative to the linear layers.
Overall, the benchmarks suggest that Gaudi-2 can be competitive in specific use cases, particularly with short sequence inputs and smaller batch sizes. However, the device still requires further optimizations to realize its full potential. Benchmarks with longer datasets revealed significant performance degradation for Gaudi-2 on longer sequences, likely attributable to the attention kernel, as prefill throughput remained consistently high.
This performance issue can be partly explained by the mismatched behavior between the PagedAttention mechanism in vLLM and the Graph Compiler on Gaudi. In fact, much of the optimization effort for vLLM on Gaudi-2 has focused on implementing a PagedAttention kernel tailored to Gaudi-2’s architecture. We will discuss this topic further in our next post.
What’s Next?
In this post, we explored the Graph Compiler, a fundamental component of the Intel Gaudi series. Gaining a deep understanding of how the Graph Compiler operates and its implications for AI model optimization is essential for maximizing the performance of Gaudi devices, particularly in LLM serving applications. Our benchmarks revealed that Gaudi-2 can deliver performance on par with its competitor, the NVIDIA A100, in specific scenarios, though there is still a need for further improvements to achieve consistent high performance.
In the next post of this blog series, we will dive into the attention kernels for Gaudi-2, which currently represent a bottleneck for serving LLMs effectively on the device. We will share the journey behind the implementation of the Gaudi-optimized PagedAttention, which has significantly boosted vLLM performance for Gaudi and is still undergoing active development. In future posts, we’ll also discuss our experiences in developing advanced features like LoRA for Gaudi-2 to enhance LLM serving.
Additionally, don’t miss out on our LLM serving benchmark tool, Fits on Chips! Designed specifically for benchmarking LLMs, this toolkit allows precise configuration adjustments for various frameworks, making benchmarking both efficient and insightful. With Fits on Chips, you can fine-tune settings and visualize their performance impact. We’re also planning to include support for vLLM on both Gaudi-2 and Gaudi-3 as benchmark candidates, enabling seamless comparisons across devices and frameworks. If you’re interested, learn more about Fits on Chips here:
Stay tuned for more insights into the LLM serving capabilities of Intel Gaudi Series!
Share article
Join the SqueezeBits newsletter today!