Guided Decoding Performance on vLLM and SGLang

The guide to LLM guided decoding! This deep-dive benchmark compares XGrammar and LLGuidance on vLLM and SGLang to help you find the optimal setup for generating structured output based on your use case.

Eunik Park

Sep 16, 2025

Guided Decoding Performance on vLLM and SGLang

Contents

Introduction How Guided Decoding Works The Decisive Factor: Serving Framework Integration Experiment Setup Experiment Results Schema Diversity & Robustness Efficiency Benchmark Conclusion

Introduction

Large Language Models (LLMs) generate responses that are inherently probabilistic and unstructured text. This makes them powerful for creative tasks, but it poses challenges when reliable, formatted outputs are required, such as ensuring consistency in API calls, database interactions, or tool-calling. For instance, an LLM tasked with generating a JSON object based on user data might prepend unnecessary explanations or deviate from the expected schema, leading to parsing errors and unreliable applications. As LLMs evolve beyond simple text generation to act as agents that autonomously call APIs or interface with external systems, the need for user-defined structured outputs has become critical to maintain stability and predictability.

Guided decoding (also known as structured output or constrained decoding) addresses this by constraining generation to formats like JSON, XML, or regex-defined strings. Modern serving frameworks like vLLM and SGLang use powerful grammar backends to enforce these rules, with two prominent players being XGrammar and LLGuidance.

However, this control introduces computational overhead. The choice of grammar backend and the serving framework it runs on has a major impact on performance and reliability. To help you choose the optimal setup, we benchmarked the two leading grammar backends, XGrammar and LLGuidance, on the two most popular serving frameworks, vLLM and SGLang.

How Guided Decoding Works

Figure 1. The guided decoding workflow, where a grammar created from a JSON schema generates token masks at each step to constrain the LLM's output.

Guided decoding works by restricting the model's next-token choices to only those that are grammatically valid given the context so far. Instead of sampling freely from the entire vocabulary, the model is constrained to generate tokens that comply with a user-defined structure. To achieve this, the standard LLM inference pipeline is augmented with the following steps:

Create Grammar from Schema: The user-defined schema (e.g., JSON Schema, regex) is compiled into a grammar representation (typically a Deterministic Finite Automaton), that defines the valid sequence of tokens.

Generate Token Mask: At each generation step, the grammar generates a token mask that filters out invalid candidates based on the previous context.

Apply Token Bitmask to LLM Logits : The mask is applied to the LLM's output logits, forcing the model to only sample from the set of grammatically correct tokens.

This process guarantees well-structured outputs but introduces computational overhead from grammar construction and per-step mask generation. Managing this overhead efficiently is crucial to avoid degrading inference performance. XGrammar and LLGuidance address this challenge with different strategies, which we evaluate in the next section.

XGrammar

XGrammar is designed to minimize runtime overhead through pre-computation. As shown in Figure 2, XGrammar partitions the LLM's vocabulary at each automaton state into two groups: context-independent and context-dependent tokens. During grammar creation, XGrammar pre-computes masks for context-independent tokens, leaving only context-dependent tokens to be validated at generation time. This design enables fast mask generation for simple schemas with relatively few context-dependent tokens. Furthermore, by caching the created grammar, it reduces grammar creation cost for repeated schemas.

However, pre-computation step itself can be time-consuming. Additionally, for complex schemas with many context-dependent tokens, performance may degrade since a large portion of the mask still needs to be validated during generation.

LLGuidance

LLGuidance takes a different approach by generating token masks dynamically at each decoding step. To avoid the upfront cost of full grammar compilation, it builds its automaton lazily. Then, at each step, it efficiently generates the token mask by traversing a pre-built prefix tree (trie) of the LLM's vocabulary. This approach enables fast mask generation even when processing complex or entirely new schemas for the first time, making it well-suited for scenarios where flexibility is more critical than minimizing per-schema initialization time.

The Decisive Factor: Serving Framework Integration

Figure 3. LLM inference pipelines. (a) No overlapping creates a serial bottleneck. (b) Overlapping initial grammar creation. (c) Overlapping grammar and mask generation with GPU processing, effectively hiding latency.

The performance of guided decoding depends not only on the grammar backend itself but also on how grammar-related processes are integrated into the serving pipeline. The most straightforward implementation executes these steps sequentially, as illustrated in Figure 3(a).

However, grammar-related processing is a CPU-intensive task, while LLM inference is a GPU-intensive task. This makes parallelization a natural optimization opportunity. For example, vLLM overlaps the initial grammar creation with the GPU's execution of other requests (Figure 3(b)). SGLang takes this a step further by also overlapping the per-step mask generation with the LLM inference step (Figure 3(c)), more effectively hiding the latency of the grammar processing.

Experiment Setup

Hardware & Software Environment

CPU : Intel(R) Xeon(R) Platinum 8480+

Memory : 480 GiB

GPU : NVIDIA H100 80GB HBM3

Model : Qwen3-8B, Qwen3-32B (TP2) Reasoning capabilities were disabled for both models.

Frameworks: vLLM v0.10.0, SGLang 0.5.0rc0, xgrammar 0.1.21, llguidance 0.7.30

Datasets

To evaluate performance across different conditions, we used a mix of benchmark and custom datasets:

Schema Diversity & Robustness We adopted JSONSchemaBench, which contains a wide variety of real-world JSON schemas. This dataset allowed us to stress-test grammar creation under diverse schema structures.

Efficiency Benchmarks We designed two distinct scenarios to capture different serving conditions:

Repetitive Schema Scenario: A custom "Book-Info" task, where1,000 requests generate book metadata (author, publisher, ISBN, etc.) from a given title, all using the same simple schema. This setup highlights the impact of grammar caching and repeated schema usage.Grammar Compilation Robustness

Dynamic Schema Scenario: Using the Github_easy and Github_medium subsets of JsonSchemaBench, where each request uses a unique schema. This simulates workloads with frequent schema changes.

Experiment Results

Schema Diversity & Robustness

To evaluate grammar creation robustness and schema coverage, we used the JSONSchemaBench dataset. This experiment measures how many schemas each framework could successfully compile while categorizing failures into three types: compilation failures, timeouts (10 seconds), and rejections by the serving framework.

Table 1. Grammar creation results on JSONSchemaBench dataset.

The results in Table 1 highlight a key trade-off in robustness. LLGuidance excels in speed, withzero timeouts, but struggles with a higher number of compilation failures. Conversely, while XGrammar handles more schemas initially, it experiences timeouts on complex ones. Furthermore, its practical utility is significantly limited by the additional schemas rejected by the vLLM integration filter, as shown in the "vLLM filter" column.

Therefore, for the subsequent performance benchmarks, we only consider the schemas in the "Pass ALL" column, representing the set that both frameworks successfully processed.

Efficiency Benchmark

Performance on Repetitive Schemas

We first evaluate performance on a simple, repetitive task using the Book-Info dataset, where every request shares the same schema. This experiment compares the vLLM and SGLang serving frameworks under three conditions: without guided decoding (labeled “baseline”), with XGrammar, and with LLGuidance. Performance is measured using Output Throughput (tokens/sec) and Mean Time Per Output Token (TPOT), and we also verify whether the generated output conforms to the schema (correct rate).

Figure 4. Performance on the Book-Info dataset. Axes are log-scaled; max concurrency doubles from 1 to 512.

The results highlight that guided decoding is essential for correctness. As shown in Figure 4, without guided decoding the correct rate never exceeds 72%. In contrast, both guided decoding backends achieve a 100% correct rate.

When comparing the two backends, XGrammar consistently outperforms LLGuidance in both throughput and TPOT across both serving frameworks. This is because the simple, repetitive schema allows XGrammar's pre-computation and caching strategy to shine, resulting in lower overhead. This performance gap becomes more apparent with larger batch sizes. Since LLGuidance generates a new mask for every request, regardless of repetition, the CPU bottleneck becomes progressively worse as concurrency increases.

The results also reveal critical differences between the serving frameworks. vLLM shows a significant performance drop with guided decoding compared to its baseline, especially at a batch size of 8 or greater. Its sequential, non-overlapped mask generation introduces overhead that directly harms performance. In contrast, SGLang's architecture effectively mitigates this cost by overlapping mask generation with the GPU's inference step. This allows it to achieve structured output with minimal performance loss, bringing its guided decoding performance much closer to its baseline.

Performance on Dynamic Schemas

To simulate real-world applications where schemas can change with every request, we benchmarked performance on dynamic schemas. For this, we used the Github_easy and Github_medium datasets, where each request is assigned a unique schema.

Github_easy Results

Although the schemas in Github_easy are relatively simple, their uniqueness tests robustness under dynamic conditions.

Figure 5. Performance on the Github_easy dataset. Axes are log-scaled; max concurrency doubles from 1 to 512.

Figure 5 shows the benchmark results for each grammar backend running on vLLM and SGLang with the Github_easy dataset. The results first highlight a key point on correctness. Without guided decoding, the rate of outputs satisfying the schema constraints is already high, hovering between 90-94%. However, using guided decoding provides a consistent improvement, pushing the correct rate to over 96% and as high as 98.2%. While guided decoding improves accuracy, failures can still occur, often due to output degeneration (e.g., unnatural token repetition such as “\n”, “\t” or spaces). When we exclude errors from these degeneration, the rate of invalid JSON format generation was only 2.21% for XGrammar and a mere 0.12% for LLGuidance.

In terms of performance, LLGuidance consistently outperforms XGrammar across both serving frameworks. Because every request introduces a unique schema, XGrammar's caching and pre-computation strategy is neutralized. LLGuidance's dynamic approach is better suited for this scenario, resulting in higher throughput and lowerTPOT.

Github_medium Results

Next, we examine a more challenging and realistic scenario using the Github_medium dataset, which contains unique and moderately complex schemas.

Figure 6. Performance on the Github_medium dataset. Axes are log-scaled; max concurrency doubles from 1 to 512.

Figure 6 shows, on this more complex dataset, the importance of guided decoding becomes even clearer. The correct rate for unconstrained decoding drops significantly, falling to as low as 61.1%. Guided decoding provides a substantial boost to correctness, improving the rate by 20-25% absolute in most cases.

Figure 7. Generation throughput over time on the Github_medium dataset for Qwen3-32B-TP2 on vLLM with a max concurrency of 64. — Figure 7. Generation throughput over time on the `Github_medium` dataset for Qwen3-32B-TP2 on vLLM with a max concurrency of 64.

Performance patterns diverge more clearly in Figure 7, which plots generation throughput over time on vLLM with Qwen3-32B-TP2 at a max concurrency of 64. Throughput is initially low due to prefilling but rises as batches shift to decoding. In this stage, unconstrained decoding (”only”) maintains the highest throughput. LLGuidance, while slower than baseline, sustains stable throughput. XGrammar, in contrast, shows erratic behavior with frequent sharp drops. These stalls indicate severe CPU bottlenecks during mask generation for new, complex schemas, which intermittently halt the entire engine.

Conclusion

Our analysis confirms that guided decoding is essential for achieving reliable and accurate structured outputs from LLMs. While baseline models often fail to adhere to specific formats, guided decoding dramatically improves the rate of structural correctness. However, the performance cost of this improvement varies significantly, and the optimal choice of tools depends entirely on the specific use case.

Simple, Repetitive Schemas → XGrammar. With predictable workloads, XGrammar’s pre-computation and caching minimize runtime overhead, delivering the highest throughput.

Dynamic, Complex Schemas → LLGuidance. In environments where each request brings a new or complex schema, LLGuidance’s dynamic strategy avoids costly pre-computation and scales more efficiently.

Serving Framework Matters. Framework integration is decisive: SGLang’s overlapping of CPU-bound grammar tasks with GPU inference allows it to hide much of the guided decoding overhead, making it a stronger choice than vLLM for many scenarios.

Achieving optimal performance is not about finding a single 'best' solution, but about making an informed choice that matches your workload to the right combination of serving framework and grammar backend. This strategic alignment is the key to unlocking maximum performance and reliability in your AI-powered systems.