[Intel Gaudi] #6. GEMM, Attention, vLLM on Gaudi

Explore how Intel’s new Gaudi-3 compares to Gaudi-2, NVIDIA A100, and H100. We analyze real-world GEMM efficiency, attention performance, and LLM serving results to uncover what truly matters for AI inference and training workloads.

Taesu Kim

Oct 28, 2025

[Intel Gaudi] #6. GEMM, Attention, vLLM on Gaudi

Contents

Introduction Architecture & Design: Gaudi-2 vs Gaudi-3 Matrix Multiplication Efficiency Attention / Memory Bandwidth Efficiency LLM Serving Benchmark: Where Gaudi Shines and Struggles Conclusion

Introduction

Over the past year, we have examined Intel’s Gaudi accelerators from multiple perspectives including its architecture, efficiency, and cost. We’ve shown that Gaudi-2 can stand as a serious alternative to NVIDIA’s A100, both in raw performance and cost-per-compute efficiency. The natural next question is: how does Gaudi-3 compare?

Gaudi-2 has already proven itself as a capable workhorse in the inference and training ecosystem. Gaudi-3 enters the stage with a bold step forward: dramatically higher compute, memory, and interconnect bandwidth, powered by Intel’s first chiplet-based architecture for AI accelerators.

Yet, as every engineer knows, raw specs rarely tell the full story. In this post, we compare Gaudi-2 and Gaudi-3 against their direct NVIDIA counterparts, A100 and H100, focusing on benchmarks that matter for ML engineers in real-world deployments:

Practical matrix multiplication efficiency vs. theoretical FLOPS

Attention performance and memory bandwidth utilization in transformer models

Actual LLM serving performance

We also take a closer look at Gaudi-3’s behavior under heavy BF16 workloads and uncover evidence of thermal and power throttling as a key performance limiter. Finally, we discuss what these findings imply for large-scale GenAI inference using Gaudi hardware.

Architecture & Design: Gaudi-2 vs Gaudi-3

Monolithic vs Chiplet: Design Philosophy

Gaudi-2 employs a monolithic design: a single die integrating all compute, memory controllers, interconnect, and networking elements.

Gaudi-3, on the other hand, adopts a chiplet architecture composed of two identical compute dies connected via a high-speed interposer. The intent is to expose a unified compute and memory address space to software, allowing seamless scaling of compute density and HBM capacity beyond the limits of a single die.

This chiplet approach enables substantial scaling, but it also brings trade-offs. Managing heat dissipation across multiple dies and maintaining consistent power delivery become complex engineering challenges that directly affect sustained performance under load.

Key Specification Comparison

Below is a summary of the key specs that set the stage for performance expectations:

Metric	Gaudi-2	Gaudi-3
Theoretical BF16 / FP8 Compute	432 TFLOPS (BF16) / 865 TFLOPS (FP8)	~1,678 TFLOPS (BF16/FP8)
Number of MMEs	2	8 (spread across two dies)
Number of TPCs	24	64
On-die SRAM	48 MB	96 MB (i.e. 48 MB per chiplet)
HBM Memory	96 GB HBM2e, ≈ 2.45 TB/s bandwidth	128 GB HBM2e, ≈ 3.7 TB/s bandwidth
Network / I/O	24 × 100 Gb RoCE, PCIe Gen4	24 × 200 Gb RoCE, PCIe Gen5

Gaudi-3 nearly quadruples BF16 compute over Gaudi-2, increases memory bandwidth by roughly 1.5×, and doubles both I/O and TPC capacity. Interestingly, FP8 throughput increases only by about 2×, reflecting a design choice to optimize die area efficiency.

Comparison with NVIDIA Counterparts

Metric	Intel Gaudi-2	Intel Gaudi-3	NVIDIA A100	NVIDIA H100
Theoretical BF16 / FP8 Compute	~432 TFLOPS BF16 / ~865 TFLOPS FP8	~1,678 TFLOPS BF16/FP8	~312 TFLOPS FP16/BF16, no FP8 support	Up to ~990 TFLOPS BF16 / ~1,979 TFLOPS FP8
On-die SRAM Capacity	48 MB	96 MB	~40 MB L2 cache (not directly equivalent)	~50 MB L2 cache (not directly equivalent)
HBM Memory Capacity	96 GB HBM2e	128 GB HBM2e	~80 GB HBM2e	80 GB HBM3
HBM Memory Bandwidth	~2.45 TB/s	~3.7 TB/s	~2.0 TB/s	~3.35 TB/s
TDP (Thermal Design Power)	600 W	~900 W	Up to ~400 W	Up to ~700 W

Notes / caveats:

On-die SRAM for NVIDIA chips is given as “L2 cache” values rather than a directly comparable “on-die SRAM” metric like the Gaudi cards. Use with caution.

TDP values vary by form factor (air-cooled vs liquid-cooled), vendor board design, and server chassis. The numbers shown are typical published maxima for the accelerator cards.

Memory bandwidth/capacity numbers may differ depending on board version (e.g., SXM vs PCIe versions for NVIDIA).

Overall, Intel’s Gaudi series offers larger HBM capacity at a comparable or lower cost, while delivering similar or higher theoretical compute throughput. On paper, Gaudi-3 should compete head-to-head with H100. Whether it achieves this in practice depends entirely on how efficiently it converts that potential into sustained throughput.

Matrix Multiplication Efficiency

General Matrix Multiplication (GEMM) remains the cornerstone of AI acceleration - it drives the bulk of compute in transformer blocks and dense layers. To better understand how each architecture utilizes its compute resources under different workloads, we benchmarked the compute unit utilization across four representative GEMM configurations.

Specifically, we swept FLOPs for the following cases:

M = N = K sweep — balanced square matrices

M = N sweep, K = 256 — small inner dimension (light compute, memory-heavy)

M sweep, N = K = 8192 — large batch dimension

K sweep, M = N = 256 — small outer dimensions, emphasizing accumulation bandwidth

Environment setup was as follows for all benchmarks in this blog:

NVIDIA GPUs (A100, H100): CUDA v12.8.9, cuDNN 9.10.2, NCCL 2.25.1, PyTorch 2.8.0+cu128

Gaudi-2: Intel Gaudi Software (SynapseAI) v1.22.1

Gaudi-3: Intel Gaudi Software (SynapseAI) v1.21.4

The plots below summarize the results for Intel Gaudi-2, Intel Gaudi-3, NVIDIA A100, and NVIDIA H100 across BF16 and FP8 precision. Together, these four cases illustrate how each device’s compute and memory pipelines behave across varying aspect ratios - from balanced to bandwidth-limited workloads - and reveal key architectural tendencies in both generations of Intel Gaudi accelerators.

Figure 1. Compute unit utilization of matrix multiplications with different shapes

Interpreting the Results

Across the four GEMM benchmark cases in Figure 1, the results reveal distinct behavioral patterns among the accelerators.

Gaudi-2: Surprisingly Strong and Consistent

Gaudi-2 consistently reaches near-perfect utilization (~99%) for both BF16 and FP8 in case 1 and 3, sustaining almost all of its theoretical throughput even at large matrix sizes. This level of stability and efficiency is unusual and demonstrates a remarkably well-designed compiler and memory pipeline. It not only rivals but often surpasses NVIDIA A100 in both precision modes, proving that its architecture can fully saturate its compute units under diverse workloads.

A100, H100, and Gaudi-3 (FP8): A Competitive Cluster

NVIDIA A100 maintains steady BF16 performance at around ~88% utilization in case 1 and 3, which remains impressive given its broader runtime flexibility. The H100 shows similar performance, reaching roughly 75–85% across precisions. Gaudi-3 in FP8 mode falls into the same range, delivering ~70–75% utilization, similar but slightly lower than H100 in most configurations. Overall, these three devices form a comparable performance tier for practical GEMM workloads.

Gaudi-3 (BF16): The Outlier

While the FP8 path of Gaudi-3 maintains consistent throughput over time, the BF16 results on Gaudi-3 diverge sharply from expectation. Utilization climbs normally at smaller matrix shapes but then drops noticeably as matrices grow larger, rather than plateauing near peak. This regression cannot be attributed to memory or dataflow limits, as FP8 runs on the same hardware remain smooth. It strongly suggests a thermal or power-throttling behavior specific to high-draw BF16 compute. Once the device approaches its power envelope, the clock frequency falls, reducing sustained throughput even though compute units remain active.

In summary, Gaudi-2 performs exceptionally well and often exceeds expectations, A100/H100/Gaudi-3 FP8 deliver comparable steady performance, and Gaudi-3 BF16 shows an atypical decline under heavy load, highlighting the possible design limits in chiplet-based architectures.

Diagnosing the BF16 Drop: Most Likely Thermal and Power Throttling

To further investigate the unexpected BF16 performance drop of Gaudi-3, we benchmarked large GEMM workloads while recording hardware telemetry including core temperature, clock frequency, power draw, and utilization using hl-smi. The results suggest that the performance degradation is most likely linked to thermal and power-management behavior.

Figure 2. Hardware telemetry of Gaudi-3 over a sustained matrix multiplication benchmark (M=N=4K, K=8K)

When running M=N=4k, K=8k GEMM, the device initially sustained stable throughput but exhibited a clear down-clock event after roughly 2000 seconds, coinciding with a plateau in power and a sustained temperature near 75°C.

Figure 3. A zoomed-in view of the down-clock event in Figure 2

A zoomed-in view of this region reveals an inverse relationship between temperature and frequency: as temperature rises, the clock rate and TFLOPS output fall nearly in lockstep.

Interestingly, the reported utilization increases during this period, which may seem counterintuitive at first. To understand why, it’s important to clarify what the utilization metric from the System Management Interface (SMI) actually represents.

According to Intel’s documentation, hl-smi reports the percentage of time within a one-second window during which the device is actively executing kernels. Note that it does not reflect TFLOPS utilization or the fraction of active compute units - for example, a kernel that runs continuously on a single core for one second would report 100% utilization, while a kernel that fully occupies all cores for only half a second would show 50%.

In our case, each iteration performs the same matrix multiplication, meaning the number of cycles per kernel may remain almost same. When the clock frequency drops, those cycles will take longer to complete in wall-clock time. As a result, the device spends more time in the “executing” state, causing hl-smi to report higher utilization, even though the actual computational throughput has decreased.

Figure 4. Hardware telemetry of Gaudi-3 over a sustained matrix multiplication benchmark (M=N=K=8K), showing immediate down-clock

For the larger M=N=K=8k configuration, the behavior was even more immediate. The device down-clocked within about 30 seconds of starting the benchmark, with TFLOPS dropping accordingly while temperature quickly stabilized near the same upper threshold. In both cases, overall utilization remained high, suggesting that compute units themselves were active but frequency scaling was limiting sustained throughput.

This pattern points toward the accelerator hitting its thermal or power envelope and invoking automatic protection mechanisms. Once the chip enters this regime, performance no longer tracks workload demand but rather the device’s thermal headroom. The fact that FP8 workloads do not exhibit this degradation supports this interpretation: FP8 requires less energy per FLOP and thus generates less heat under identical throughput (which is an interesting design choice of Gaudi-3).

In summary, while we cannot definitively isolate a single root cause, the telemetry aligns closely with temperature-driven frequency scaling or package-level power capping. This behavior is consistent with known challenges in chiplet-based designs, where uneven thermal distribution between dies can trigger early throttling. Further investigation, such as monitoring per-die temperature sensors or repeating the tests under liquid cooling, would help confirm whether the issue stems from localized thermal hotspots or global power-management limits.

Attention / Memory Bandwidth Efficiency

While GEMM defines compute-bound performance, attention layers are fundamentally memory-bound. The efficiency of transformer inference depends not only on raw compute throughput but also on how effectively the accelerator can stream large Key/Value (KV) tensors from memory. This section evaluates how Intel Gaudi and NVIDIA GPUs utilize their memory bandwidth in both contiguous and dynamic attention workloads.

Scaled Dot-Product Attention (SDPA)

We first benchmarked scaled dot-product attention (SDPA). SDPA is the simplest form of attention used in the prefill stage of LLM inference, where all tokens are processed sequentially and memory accesses are fully contiguous.

For each accelerator (Gaudi-2, Gaudi-3, A100, H100), we did:

Measure the total latency of the SDPA kernel.

Count the bytes read from DRAM during attention.

Compute the resulting memory bandwidth utilization, computed as (read_bytes / (latency × theoretical_bandwidth)).

Figure 5 . Memory bandwidth utilization (SDPA, batch size = 1)

To interpret these measurements, we compare the achieved bandwidth utilization across devices as a function of sequence length. Since SDPA is a contiguous-access workload dominated by reads of Key and Value tensors, it serves as a good proxy for how well each accelerator’s memory subsystem sustains long sequential transfers. The resulting utilization curves reveal clear differences in memory scheduling efficiency and thermal stability between architectures.

Findings

A100 and H100 achieve the highest bandwidth utilization, peaking at ~85–95%, and quickly plateau across sequence lengths: a sign of well-optimized, stable memory scheduling.

Gaudi-2 and Gaudi-3 trail behind but remain competitive up to ~500K tokens, beyond which utilization drops gradually.

Gaudi-3 shows a sharper decline at long sequences, which likely correlates with BF16 performance throttling observed in GEMM tests. As attention computations rely on BF16 computations internally, any thermally constrained frequency scaling may directly impact the memory throughput.

This result suggests that Gaudi’s SDPA kernels are well-optimized for contiguous data but still limited by power and thermal factors during sustained long-sequence workloads.

Dynamic Attention (PagedAttention)

While SDPA represents the best case for contiguous access, PagedAttention reflects the decode phase of LLMs — the most challenging regime for both hardware and software. In this phase, the model retrieves only relevant KV pages from memory, creating irregular and non-coalesced access patterns that stress memory schedulers.

To benchmark this realistically, we compared:

flatPA (Gaudi’s custom implementation of PagedAttention), and

FlashAttention v3 (for A100 and H100),

across two batch configurations: BS = 1 and BS = 16.

Figure 6. Memory bandwidth utilization (PagedAttention, batch size = 1 and 16)

Findings

NVIDIA GPUs (A100 / H100)

FlashAttention v3 sustains ~60–75% utilization, reaching a stable plateau even with very long context lengths.
With BS = 16, both A100 and H100 scale efficiently, maintaining ~70% utilization as batch size increases — showing excellent memory coalescing and scheduling efficiency under dynamic workloads.

Intel Gaudi-2 / Gaudi-3 (flatPA)

Bandwidth utilization peaks around 25–30% for Gaudi-3 and 20–25% for Gaudi-2 at BS = 1, improving slightly with larger batch sizes but remaining below 30% even at BS = 16.
The gap versus NVIDIA devices widens significantly as sequences lengthen, suggesting persistent inefficiency in how Gaudi handles dynamic KV-cache paging.

Runtime and Memory System Differences

The performance contrast between Gaudi and NVIDIA GPUs in attention workloads mainly reflects how each platform handles complex memory accesses.

Compiler-Driven Execution on Gaudi

Gaudi’s runtime is compiler-driven: tensor layouts and memory access patterns are determined during compilation. This static approach enables highly optimized execution for fixed-shape workloads but adds overhead when operations require dynamic data movement, such as rearranging KV cache pages for attention. These rearrangements must follow precompiled memory schedules, which limits flexibility and increases latency for workloads with variable sequence lengths.

flatPA: Compute-Assisted Rearrangement

Gaudi’s flatPA implementation mitigates this overhead by reformulating memory rearrangement as a matrix multiplication with a one-hot matrix, allowing the compute cores to handle data reordering. This design takes advantage of Gaudi’s strong matrix-math throughput and overlaps some of the rearrangement cost with computation. Still, as sequence lengths increase and KV caches grow, the residual overhead becomes more noticeable, leading to declining effective bandwidth.

GPU Memory Flexibility and Kernel Optimization

NVIDIA GPUs, in contrast, can issue memory transactions to arbitrary addresses at runtime, enabling them to handle scattered KV-cache access patterns without prior layout assumptions. This architectural flexibility allows high-performance kernels such as FlashAttention to fully exploit GPU hardware characteristics: fine-grained memory access control, high concurrency in memory pipelines, and fast on-chip buffering. These software-hardware co-optimizations translate directly into higher sustained bandwidth utilization in attention workloads.

In short, Gaudi’s attention performance is constrained less by compute capability and more by the rigidity of its precompiled memory model. The flatPA kernel effectively leverages Gaudi’s compute power to compensate, but the system still trails GPUs in workloads that rely on flexible, dynamic memory scheduling. As Gaudi’s compiler and kernel ecosystem continue to evolve, improving this dynamic adaptability will be key to narrowing the performance gap in attention-intensive inference.

LLM Serving Benchmark: Where Gaudi Shines and Struggles

After examining GEMM and attention microbenchmarks, a natural question arises: If Gaudi trails in attention performance, can it still compete in real-world LLM inference?

The answer depends heavily on model size, precision, and memory behavior. While Gaudi lags in kernel-level efficiency, its larger on-board memory often turns into a decisive advantage in end-to-end workloads.

The Hidden Advantage: Memory Capacity

One of Gaudi’s defining strengths lies in memory capacity.

Gaudi-2 ships with 96 GB of HBM2e.

Gaudi-3 expands this to 128 GB.

By contrast, NVIDIA’s A100 and H100 both provide 80 GB.

In LLM serving, where KV caches for multi-billion-parameter models can exceed tens of GBs per instance, this additional memory directly translates into larger active batch sizes and longer feasible context windows without offloading. The difference becomes particularly pronounced for models above 30 B parameters, in case of serving with a single card.

To test this, we benchmarked Qwen3-32B, a 32-billion-parameter LLM well-suited for single-card inference, across both Intel and NVIDIA accelerators.

Benchmark Setup

We measured end-to-end LLM serving performance using two evaluation modes:

Fixed-length benchmark — input sequences of 1K, 2K, 4K, 8K tokens, with 1K output tokens without end-of-sentence (eos) token.

Dynamic-length benchmark — random input lengths up to 1K, 2K, 4K, 8K tokens, and up to 1K output tokens, using the dynamic_sonnet_llama3 dataset.

For both settings, we varied user concurrency, plotting TPOT (time-per-output-token) versus throughput to capture the trade-off between user experience and serving cost. Each system used an equivalent number of accelerators and identical model weights and precisions.

Gaudi-2 vs A100: A Clear Win

Figure 7. TPOT vs Throughput plot of fixed dataset benchmark on Gaudi-2 (TP=2) and A100 (TP=2).

Figure 8. TPOT vs Throughput plot of dynamic dataset benchmark on Gaudi-2 (TP=2) and A100 (TP=2).

With two devices each (Gaudi-2 and A100 both TP = 2), Gaudi-2 consistently outperformed A100 across all sequence lengths and benchmark modes.

This advantage stems from two main factors:

Larger memory footprint – Gaudi-2’s 96 GB avoids KV-cache offloading and fragmentation, maintaining consistent latency even for 8K contexts.

Native FP8 acceleration – Unlike A100, Gaudi-2 supports hardware FP8 compute, allowing greater parallelism and lower memory bandwidth pressure compared with BF16 or FP16.

Across both fixed and dynamic tests, Gaudi-2 achieved higher throughput at lower TPOT values, indicating superior efficiency.

Gaudi-3 vs H100: A Context-Dependent Story

The newer generation comparison paints a more nuanced picture.

BF16 Precision

Figure 9. TPOT vs Throughput plot of fixed dataset benchmark on Gaudi-3 and H100 (BF16).

Figure 10. TPOT vs Throughput plot of dynamic dataset benchmark on Gaudi-3 and H100 (BF16).

On BF16, Gaudi-3 outperformed H100 in throughput for the fixed benchmark.

This advantage mainly stems from H100’s limited memory capacity: with 80 GB of HBM3, roughly 64 GB is consumed by Qwen3-32B weights alone, leaving less than 16 GB available for KV caches and activations. This restricts the active batch size that can fit entirely in memory, reducing achievable throughput. The effect is visible in the benchmark figures: as input sequences grow longer, multiple data points begin to overlap. Overlapping data points indicate that the effective batch size becomes almost fixed by memory constraints. In contrast, Gaudi-3’s 128 GB HBM provides more headroom to accommodate both model weights and large KV caches simultaneously, supporting higher concurrency and longer context windows without performance collapse.

In the dynamic benchmark, H100 benefits from shorter average sequence lengths, which reduce KV-cache usage and allow it to run with a slightly larger active batch size. This enables H100 to extend performance into longer sequences compared with the fixed benchmark. However, its memory ceiling still constrains scalability, meaning Gaudi-3 continues to achieve higher peak throughput, particularly at large contexts or higher user concurrency.

FP8 Precision

Figure 11. TPOT vs Throughput plot of fixed dataset benchmark on Gaudi-3 and H100 (FP8).

Figure 12. TPOT vs Throughput plot of dynamic dataset benchmark on Gaudi-3 and H100 (FP8).

With FP8 precision, the balance shifts in a more nuanced way. The Qwen3-32B model footprint drops to around 32 GB, significantly easing memory pressure for both Gaudi-3 and H100. This change allows both devices to operate with higher active batch sizes, and the performance differences start reflecting the interaction between compute efficiency and memory scaling, rather than sheer capacity limits.

In the fixed-length benchmarks, Gaudi-3 once again led in overall throughput. Its 128 GB of HBM provided enough headroom to accommodate not just the full model weights, but also sizable KV caches for long sequences, enabling larger concurrent batch sizes and sustained performance even at 8K contexts. H100, while more efficient in raw compute, hit the edge of its usable memory range at higher sequence lengths, capping its achievable throughput earlier. Here, memory capacity played a decisive role - Gaudi-3 maintained higher throughput without cache offloading.

In the dynamic-length benchmarks, however, the advantage shifted to H100. With shorter and variable input sequences, KV-cache utilization was lower on average, meaning both accelerators could fit similar batch sizes into available memory. Under these conditions, H100’s stronger attention kernel efficiency, better thermal stability, and higher sustained clock frequencies became more influential. As a result, H100 outperformed Gaudi-3 across most dynamic workloads, achieving smoother scaling and lower TPOT at comparable throughput levels.

Takeaways

Across all benchmarks, several consistent patterns emerge.

Gaudi-2 demonstrated clear and repeatable advantages over A100.

Its larger 96 GB HBM and native FP8 acceleration enabled higher throughput and stable latency across both fixed and dynamic workloads. In every configuration tested, Gaudi-2 sustained larger active batches and delivered superior efficiency per accelerator.

Gaudi-3 showed more mixed results against H100.

In memory-bound scenarios, particularly with large context lengths or fixed sequence inputs, Gaudi-3’s 128 GB HBM allowed higher concurrency and higher peak throughput. However, when workloads became more dynamic and average sequence lengths shortened, H100 regained the lead, benefiting from its stronger attention kernels and higher sustained frequency.

Overall, these results indicate that Gaudi-based accelerators can excel when inference is constrained by memory, such as in long-context or high-concurrency deployments.

Conclusion

Our analysis shows that Intel’s Gaudi series offers a more nuanced story than raw specifications might suggest.

Gaudi-2 demonstrates exceptional maturity — it delivers near-peak GEMM efficiency, stable attention performance, and tangible cost and memory advantages over NVIDIA’s A100. Its native FP8 path, generous 96 GB HBM, and reliable thermal envelope make it one of the most practical and cost-effective accelerators available today for GenAI inference and even mid-scale training workloads. In direct comparison, Gaudi-2 not only matches A100’s performance but surpasses it across both fixed and dynamic LLM serving scenarios.

Gaudi-3, however, paints a more complex picture. On paper, it competes directly with H100: higher compute, more memory, and stronger I/O. In practice, its performance depends heavily on precision mode and workload structure. Namely, Gaudi-3 can outperform H100 in long-context, memory-heavy inference tasks - where its 128 GB of HBM per card allows larger batch sizes and avoids offloading.

Yet, the key takeaway is that Gaudi-3 remains viable - not as a universal replacement for H100, but as a specialized alternative when memory capacity, long-context models, or FP8-based serving pipelines with less cards are the priority. Its combination of large memory, strong compute density, and competitive pricing makes it particularly attractive for large-context LLM inference and cost-sensitive deployment environments.

As driver updates, compiler optimizations, and firmware refinements continue, we expect Gaudi-3’s efficiency to close the gap further. For teams building large-scale inference stacks, Intel’s Gaudi line deserves a serious look - not just as a cheaper alternative, but as a strategically differentiated platform for high-memory, FP8-optimized AI serving.