[Intel Gaudi] #1. Introduction

In this blog series, we thoroughly evaluate Intel's AI accelerator, the Gaudi series, focusing on its performance, features, and usability.

Taesu Kim

Nov 21, 2024

Contents

Disclaimer Introduction Intel Gaudi-2 Hardware Overall Hardware Design Interesting Design Choices Software Stack for Intel Gaudi Series SynapseAI Software Suite Graph Compiler and the PyTorch Runtime Custom Kernel Support with TPC LLM Serving Framework for Intel Gaudi Series vLLM for Intel Gaudi Series Overview of Current Performance What’s Next?

Disclaimer

⚠️

This blog series is being written independently, without any input or influence from Intel. Our objective is to provide an unbiased evaluation by discussing both the strengths and the limitations of the Gaudi-2 accelerator as a third-party user. Really.

Introduction

At SqueezeBits, our mission is to make AI both efficient and accessible. Achieving efficiency in AI typically involves two primary approaches: optimizing models or leveraging advanced hardware to accelerate workloads. Our approach focuses on hardware-software co-optimization, applying techniques such as quantization and pruning in ways that align with the hardware's capabilities. In addition, we develop device-specific kernels and optimizations to accelerate AI models, fully leveraging the specifications of the target hardware.

This strategy has expanded with additional approaches, becoming particularly critical in the era of Large Language Models (LLMs). As LLM-based services advance, the demand for performance optimization has grown, driven by the need to reduce serving costs. At the same time, maintaining the integrity of the model presents a parallel challenge, as preserving the model unchanged is essential for ensuring the safety and reliability of the service. Addressing these challenges requires two key initiatives: improving serving frameworks to handle dynamic workloads more efficiently and leveraging specialized hardware designed to meet the unique requirements of LLMs.

Among the noteworthy hardware solutions for LLM serving is Intel’s Gaudi series, purpose-built accelerators for AI workloads. The Gaudi-2 accelerator has been accessible through the Intel Developer Cloud for some time, and the recently launched Gaudi-3 marks Intel’s next step in the AI accelerator market. Gaudi-3 is now also available via Intel Developer Cloud, and IBM has announced its first on-premise cloud deployment of this cutting-edge hardware.

We’ve been deeply involved in evaluating and developing an LLM serving framework for Gaudi-2. Our journey with Gaudi-2 has taken us from a basic implementation of vLLM for Gaudi-2 to its current state, where we have observed several orders of magnitude in throughput improvement. With support for the Gaudi series recently integrated into the main vLLM repository, we believe now is the perfect time to share our experiences with Gaudi-2 and its performance in LLM inference.

In this upcoming blog series, we will present in-depth benchmarks of vLLM on Gaudi-2 and share our hands-on experience implementing specific features on this accelerator. We hope this series offers valuable insights to those interested in exploring Gaudi-2, providing a glimpse into its performance and the user experience it delivers. In this post, we’ll focus on the fundamentals of Gaudi-2 and examine its overall LLM serving performance using vLLM.

Intel Gaudi-2 Hardware

Figure 1. Gaudi-2 Processor High-level Architecture (source)

Overall Hardware Design

The Intel Gaudi-2 is a dedicated AI accelerator designed to handle both AI training and inference tasks. It features two Matrix Multiplication Engines (MMEs) and 24 Tensor Processor Cores (TPCs), which work together to accelerate matrix multiplication and other non-matrix operations. Each Gaudi-2 card comes equipped with 96GB of HBM2E memory, offering more capacity and higher memory bandwidth—up to 2.46TB/s—compared to competitors such as NVIDIA’s A100-80GB. With its dedicated MMEs for matrix operations and superior memory bandwidth, Gaudi-2 achieves peak performance exceeding 400 TFLOPS on BF16 precision.

Table 1. Compute and Memory specs of NVIDIA A100, H100, Intel Gaudi-2 and Gaudi-3.

Gaudi-2 also incorporates a networking architecture built on RDMA over Converged Ethernet (RoCEv2). The accelerator integrates 24 Ethernet ports capable of 100 Gbps communication, with 21 ports reserved for all-to-all connectivity between Gaudi-2 cards in a single node. While it lacks a specialized communication solution like NVIDIA’s NVLink, Gaudi-2’s networking design delivers sufficient bandwidth for features such as Tensor Parallelism during LLM inference. Additionally, its reliance on cost-efficient standard Ethernet hardware makes it an attractive option for scaling AI workloads.

Interesting Design Choices

A standout feature of the Gaudi-2 is its support for the FP8 data format, which includes both E4M3 and E5M2 types. This capability enhances performance for both AI training and inference, with Gaudi-2 achieving a peak performance of over 800 TFLOPS on FP8 precision. Despite some limitations, such as the restricted range of E4M3 (addressed in Gaudi-3), the inclusion of FP8 support gives Gaudi-2 a distinct advantage over competitors like the A100, which lack hardware acceleration for FP8.

Another notable aspect of the Gaudi-2’s design is the absence of a traditional memory hierarchy. Unlike NVIDIA GPUs, which feature multiple layers of memory hierarchy (L2-cache, L1-cache, shared memory, and registers), Gaudi-2 employs a simpler approach with 48MB of local SRAM. This SRAM is shared across MMEs, TPCs, DMAs, and other components for data pipelining. This design choice is made possible by Gaudi-2’s larger MMEs, which support bigger matrices compared to GPU counterparts. The larger MMEs allow for more efficient data reuse, reducing the need for extensive memory bandwidth during computations.

While this simplification alleviates the complexity of managing memory hierarchies for users, it shifts the responsibility to the compiler. The compiler must tightly schedule memory communications and computations before workloads are executed. To address this, Intel provides the SynapseAI Software Suite, a comprehensive toolkit designed to optimize Gaudi-series hardware for AI workloads, ensuring efficient execution and performance.

Software Stack for Intel Gaudi Series

Figure 2. Intel Gaudi Software Suite (source)

SynapseAI Software Suite

The SynapseAI Software Suite provides a comprehensive software stack for the Gaudi series, designed to efficiently map neural network architectures onto Gaudi hardware. This suite includes several components, such as the graph compiler, runtime, TPC kernel library, firmware and drivers, as well as developer tools like the TPC SDK and Profiler. It is built to support popular machine learning libraries, particularly PyTorch, while requiring minimal intervention from users.

Graph Compiler and the PyTorch Runtime

At the core of SynapseAI’s functionality are the graph compiler and PyTorch runtime, which play a crucial role in translating AI models developed in PyTorch into optimized instructions for Gaudi hardware. As previously mentioned, Gaudi-2 lacks a complex memory hierarchy and instead depends entirely on the graph compiler to pre-compute memory addresses for data. This necessitates pre-compiling each workload before execution. The graph compiler, in conjunction with the PyTorch runtime, automatically aggregates PyTorch operators into computational graphs and compiles these graphs into executable runtimes optimized for the hardware. Furthermore, the graph compiler applies various optimization techniques, such as operator fusion, to boost computational efficiency and enhance performance.

While this approach enables significant acceleration, it also comes with certain drawbacks. The requirement for compilation before every workload introduces host latency, potentially impacting performance. To mitigate this, the runtime uses hash-based matching for compiled graphs, reducing the frequency of recompilation. However, effective use still requires careful planning during the warm-up stage for pre-compilation and the allocation of dedicated device memory to store computational graphs. Additionally, the proprietary nature of the graph compiler and PyTorch runtime makes user-driven optimization more challenging, as much of the process operates as a “black box.”

To address this limitation, future updates plan to incorporate support for the torch.compile function. This enhancement would give users greater transparency into the compilation process and provide tools to intervene in how computations are executed, empowering users to fine-tune their workloads further.

Custom Kernel Support with TPC

The TPC kernel library and SDK are also key components of the SynapseAI Software Suite, offering users the flexibility to implement custom kernels for improved workload optimization. Using the TPC kernel library, developers can create custom computation logic that leverages the accelerator’s TPCs. The library adopts a C-like coding structure for ease of use and provides access to both HBM and local SRAM, along with a variety of computation primitives.

However, one notable limitation is the absence of primitives for the MMEs within the TPC kernel library. As a result, custom kernels cannot directly accelerate matrix multiplication using the MMEs. While this simplifies kernel design overall, it prevents users from explicitly implementing advanced optimizations, such as overlapping TPC and MME computations. Nevertheless, the TPC SDK provides an interface for users to guide the graph compiler in overlapping TPC kernel operations with MME computations, enabling some degree of optimization for complex workloads.

LLM Serving Framework for Intel Gaudi Series

With the SynapseAI Software Suite, PyTorch-based AI models have been successfully deployed on Gaudi-2. Among these deployments, supporting LLMs has emerged as the highest priority. LLMs are currently among the most prominent AI models, requiring both high memory bandwidth and computational capability—areas where Gaudi-2 excels. Several open-source projects have implemented LLM support for Gaudi-2, including DeepSpeed for Gaudi, Text Generation Interface (TGI) for Gaudi, and vLLM for Gaudi. Of these, vLLM for Gaudi has seen the most active development recently, as the vLLM framework is widely used across academia and industry for LLM serving.

vLLM for Intel Gaudi Series

vLLM is an open-source LLM serving framework designed around a specialized attention mechanism called PagedAttention, which improves computational throughput by increasing the number of batched requests. PagedAttention manages KV-caches at the page level using on-the-fly page allocation, similar to traditional CPU caches, thereby reducing memory waste from pre-allocated KV-cache space. This reduction in memory requirements allows the LLM serving scheduler to batch more decode requests simultaneously, leading to better throughput. However, the trade-off for this throughput improvement is increased dynamicity in KV-cache management, as on-the-fly page allocation adds complexity. PagedAttention has been extensively optimized for GPUs, enabling them to handle this dynamicity with minimal overhead.

Adapting PagedAttention to Gaudi-2, however, has been significantly more challenging due to the device’s reliance on the graph compiler. Specifically, the need to compute memory addresses using the graph compiler introduced substantial overhead when gathering non-contiguous KV-cache pages. Additionally, the requirement to pre-compile computation graphs for every workload made it particularly difficult to support input sequences with dynamic batch sizes and lengths. Consequently, much of the optimization effort for vLLM on Gaudi-2 has focused on improving the PagedAttention implementation to suit the hardware’s unique characteristics.

Overview of Current Performance

Over the past six months, substantial progress has been made, leading to significant throughput improvements for vLLM on Gaudi-2 by several orders of magnitude. As a result, the performance of vLLM on Gaudi-2 has become competitive with NVIDIA A100 in certain models and benchmark scenarios.

Figure 3. Time-Per-Output-Token (TPOT) vs. Throughput plot of Gaudi-2 and A100

For example, Figure 3 illustrates a benchmark comparison of vLLM running on Gaudi-2 versus NVIDIA A100-SXM. Both devices utilized vLLM v0.6.3 to ensure a fair comparison, and Llama-3.1-8B-Instruct model was benchmarked on a single device for each. The maximum input and output token lengths were set to 1,024. Input lengths varied based on the dynamic_sonnet_llama3 dataset and output lengths determined by the detection of an end-of-sentence token. The maximum batch size was set to 256.

The benchmark explored the relationship between Time-Per-Output-Token (TPOT) and Throughput under varying number of parallel active requests. This analysis provides insights into the trade-off between user experience and serving costs. As illustrated in the figure, increasing throughput comes with higher TPOT, which may negatively impact user experience. We observed that Gaudi-2 performs competitively with the A100 in scenarios with low TPOT. However, as TPOT increased, Gaudi-2's throughput began to degrade more significantly. Despite this, the benchmark results show that Gaudi-2 offers comparable performance in this use case, showcasing its potential as a viable alternative to the A100 for LLM serving.

What’s Next?

In this post, we explored the key technical aspects of the Intel Gaudi series, especially Gaudi-2, covering both its hardware and software characteristics. Over the past six months, LLM serving support on Gaudi-2 has undergone significant development, culminating in its inclusion in the main vLLM repository as an officially supported hardware backend. While our benchmarks demonstrated that serving LLMs with Gaudi-2 can achieve performance comparable to the NVIDIA A100, there still remains a noticeable performance gap in some cases.

In the upcoming posts of this blog series, we will provide in-depth benchmarks to analyze the current state of LLM serving on Gaudi-2 and discuss what the near future holds for this hardware. We will also share our experiences with developing specific features for LLM serving, such as LoRA, for Gaudi-2 also.

Also, don’t miss our LLM serving benchmark tool Fits on Chips! This toolkit is designed specifically for benchmarking LLMs, offering precise configuration adjustments for various frameworks. With Fits on Chips, you can fine-tune settings and visualize their impact on performance, making the benchmarking process more efficient and informative. We will soon include support for vLLM on both Gaudi-2 and Gaudi-3 as benchmark candidates, allowing you to easily compare results across devices and frameworks. If you’re interested, learn more about Fits on Chips here:

Fits on Chips

Discover the best LLM deployment parameters for your Chips.

https://fitsonchips.ai/

Stay tuned for more insights into the LLM serving capabilities of Intel Gaudi Series!

Join the SqueezeBits newsletter today!

See more posts