Vocabulary Trimming: An Easy and Effective Method for SLM Acceleration

Trimming large multilingual vocabularies in Small Language Models (SLM) is a simple, low-risk way to boost efficiency to its limit. It accelerates the model inference significantly while keeping accuracy almost unchanged.

Semin Kim

Aug 04, 2025

Vocabulary Trimming: An Easy and Effective Method for SLM Acceleration

Contents

Introduction Embedding Compression via Vocabulary Trimming Strategy 1: Language-based Trimming Strategy 2: Frequency-based Trimming Experiment Setup Models Terminology Dataset for Trimming Experimental Results Quality Evaluation Inference Speed Evaluation Conclusion

Introduction

The shift towards massive, multilingual language models has led to correspondingly massive vocabularies. State-of-the-art models are trained on diverse corpora covering dozens of languages, resulting in vocabulary sizes that often exceed 100K tokens. For example, Llama3 models have a vocabulary size of over 128K tokens, while Qwen-3 and Gemma-3 come with vocabularies of approximately 150K and 256K tokens, respectively.

In large-scale models with tens of billions of parameters, this isn't a concern. The embedding layer is minor in terms of overall model size. However, for smaller models ranging from 0.5B to 2B parameters (Small Language Models, SLM), the situation is drastically different. Here, the embedding layer is no longer a minor component—it becomes a significant portion of the entire model. While the embedding layer is a negligible fraction of the large models, it can account for roughly 25-30% of the total model size in smaller models such as Qwen3-0.6B and Gemma-3-1b-it.

Moreover, the embedding layer's share of the total model size becomes even larger when the model is quantized for better efficiency. Applying weight-only quantization schemes like GPTQ or AWQ substantially reduces weight parameters, but leaves the embedding layer unchanged, thereby making its proportional contribution even more substantial.

Figure 1 shows the ratio of the embedding layer size to the total model size across real models of various scales.

Figure 1. Comparison with LLMs and SLMs based on the ratio of embedding layer size to the total model size

Based on the observation, we can think that by optimizing the embedding layer we can meaningfully reduce overall memory usage and improve inference speed. In this article, we introduce a practical strategy to remove 'dead tokens' from the vocabulary for embedding compression.

Dead tokens are tokens never used during inference. For example, when deploying one of those powerful multilingual models like Qwen-3 or Gemma-3 for a service that operates only in English or Korean, tokens related to other languages-such as Chinese, Japanese, Arabic, and dozens of others—remain completely inactive. In such cases, over 40% of the vocabulary may be functionally useless, wasting memory and compute cycles without contributing to model output. If we can get rid of those dead tokens, then we might achieve following benefits:

Reduced Memory Footprint: When we reduce the embedding size, the most straightforward benefit is the reduced memory requirement. Removing dead tokens reduces VRAM consumption, making it feasible to deploy SLMs on resource-constrained environments.

Faster LM Head Computation: Most SLMs use tied embeddings for parameter efficiency, therefore reducing the vocabulary directly shrinks the compute cost of logits for every single generated token. This effectively accelerates the token generation process, especially during the decode phase.

Improved Tokenizer Efficiency: With the reduced tokenizer file size and vocabulary, there would be significant improvements in overall tokenization speed, including both loading and processing times.

So, can we safely remove the dead tokens to achieve these performance gains without model quality degradation? The short answer is YES; Embedding Compression via Vocabulary Trimming.

Embedding Compression via Vocabulary Trimming

To realize the efficiency gains discussed earlier, we leverage vocabulary trimming—a straightforward technique proposed in previous researches (Ushio et al., 2023; Vicenti et al., 2024; Goel et al., 2025). We thought that it would be useful for recent SLMs, especially in on-device environments where memory usage and latency are critical. An overview of this technique is shown in Figure 2.

Figure 2. Overview of vocabulary trimming technique for embedding layer compression

Then, how do we identify dead tokens? Here we explore two different—yet compatible—strategies that can be used together.

Figure 3. Two strategies for trimming vocabulary—based on language and frequency

Strategy 1: Language-based Trimming

The most intuitive and conservative approach is language-based trimming, a nearly lossless compression strategy. The goal is simple: to remove all tokens from languages that do not align with the application's target requirements.

For a service focused on English and Korean, this means we can eliminate the vast majority of tokens corresponding to other languages. The process involves the following steps:

Unicode-based Token Identification: Analyze each token's Unicode characters to determine its language, then retain only tokens from target languages and special tokens while filtering out all others.

Filtered Vocabulary Construction: Build a new, compact vocabulary file and tokenizer configuration containing only the essential tokens identified in the previous step.

Embedding Extraction: Extract the embedding vectors corresponding to our target language tokens from the original model and transfer these weights to a new, smaller embedding layer.

Layer Replacement: Replace the model's original embedding layer with the newly constructed, optimized version.

This strategy is a safe and highly effective step, as it only removes tokens that would have gone unused anyway.

Strategy 2: Frequency-based Trimming

Even within a target language, rarely-used tokens still consume memory while contributing little to the model output. Frequency-based Trimming addresses this inefficiency through a lossy compression strategy that can work in combination with language-based trimming.

The implementation follows three steps:

Target Corpus Preparation: Prepare a large, representative text corpus for each target language to serve as the basis for frequency analysis.

Token Frequency Analysis: Tokenize the entire corpus using the original tokenizer and calculate the frequency distribution of all tokens across the dataset.

Rare Token Removal: Identify the least frequent tokens (e.g., bottom 5% or 10%) as "rare" and remove them from the vocabulary. The removal threshold can be adjusted arbitrarily.

However, this approach involves a critical trade-off: aggressive trimming boosts inference speed but increases Out-of-Vocabulary encounters, potentially causing quality degradation. This makes it crucial to find the right balance for your specific needs.

To validate these trimming strategies, we conducted a series of experiments to measure their practical impact on model quality and inference speed.

Experiment Setup

Models

We selected three SLMs with large, multilingual vocabularies to evaluate the impact across different architectures.

Qwen3-0.6B

Gemma-3-1b-it

Llama-3.2-1B-Instruct

To further test the generalizability of our methods, we also included EXAONE-3.5, a model with a vocabulary already specialized for English and Korean. This allowed us to investigate whether frequency-based trimming remains effective even on a tailored vocabulary.

EXAONE-3.5-2.4B-Instruct

We used 4-bit weight-only quantized version of these models for our evaluation. For quantization, group-wise symmetric round-to-nearest method was used with group size of 64.

Terminology

We started with a baseline version and applied vocabulary trimming strategies step-by-step to create several compressed versions:

Baseline (B): The original, unmodified model.

Lang-Trim (L) : The model after applying language-based trimming (retaining only English and Korean tokens)

Lang+Freq-Trim (LFn): The model after applying language-based trimming and then frequency-based trimming at various thresholds. (n = 5, 10, 20, 40, 60%)

Dataset for Trimming

For our frequency-based trimming strategy, we prepared a balanced corpus with 1 billion tokens by sampling 500 million tokens each from the English C4 dataset and the Korean textbook dataset. This ensured that the frequency distribution is not biased toward one language.

Experimental Results

Figure 4 shows the final vocabulary sizes for each model after applying vocabulary trimming.

Figure 4. Compressed vocabulary sizes of each model after applying vocabulary trimming

Quality Evaluation

We first evaluated whether vocabulary trimming affects the model’s core reasoning and linguistic capabilities. Since our optimization target was to preserve performance in English and Korean, we limited our evaluation to benchmarks in those two languages. We used the lm-evaluation-harness to assess quality of each compressed models on several key benchmark in a zero-shot setting:

English: ARC-Challenge, MMLU, HellaSwag

Korean: HAE-RAE, KOBEST, KMMLU

Quality Evaluation Results

Table 3. Quality evaluation results of Llama-3.2-1B-Instruct (W4A16)

For the multilingual models, we observed two clear patterns. As shown in Tables 1-3, the Lang-Trim strategy behaved as expected—removing unused (dead) tokens with negligible impact on accuracy, which can be attributed to the occasional presence of dead tokens (e.g., Chinese Character) in the benchmark datasets and some minor artifacts from log-likelihood calculations.

Building on that, the Lang+Freq-Trim approach exceeded our expectations. By trimming up to 40% of the remaining tokens based on frequency—even within the target language—we saw minimal drop in benchmark scores. However, when pushing past the 60% mark, model quality began to deteriorate noticeably. These results suggest that while moderate trimming is both safe and effective, aggressive vocab trimming must be approached with care—balancing memory savings against potential quality loss.

Table 4. Quality evaluation results of EXAONE-3.5-2.4B-Instruct (W4A16)

In contrast, the EXAONE-3.5-2.4B-Instruct-w4a16 showed different output (see Table 4). Since its vocabulary was already optimized, the size reduction rate of Lang-Trim was only 3.7%. Furthermore, while the multilingual models maintained consistent quality with up to Lang+Freq-Trim (20%), it began to show quality degradation on Korean benchmarks after trimming just the bottom 5% of rare tokens. This means trimming is much riskier for tailored model than for multilingual models, requiring greater caution when applying vocabulary trimming techniques.

This highlights a key insight: vocabulary trimming is significantly riskier for tailored models than for broad multilingual ones. For models with handcrafted or domain-specific vocabularies, even minor pruning can harm performance—making careful analysis essential before applying these techniques.

Inference Speed Evaluation

Now that we’ve confirmed vocabulary trimming is safe in terms of quality, the next question is whether it actually improves inference speed. Since SLMs are typically used in resource-constrained environments, we chose to run our tests on a mobile phone instead of a server—allowing us to measure the real-world impact of trimming on latency, memory, and runtime performance.

Hardware and Framework Specification

Device: Apple iPhone 15 Pro (iOS 18.5)

Framework: MLX-Swift (0.25.6)

Workload Configurations

We tested our models under two distinct workload configurations.

Prefill-Heavy: Input length of 512 and output length of 64

Decode-Heavy: Input length of 64 and output length of 512

To simulate realistic usage, we generated random token sequences and ran each test multiple times, reporting the average latency across runs. In line with typical on-device scenarios, all measurements were taken with a batch size of 1.

Results

Figure 5. Inference speed of Qwen3-0.6B (W4A16)

Figure 5 highlights the performance gains observed on Qwen3-0.6B-w4a16, a model where the embedding layer accounts for 57% of total parameters. Even the initial Lang-Trim strategy reduced the overall model size by ~16%, resulting in ~10% speedup for prefill-heavy workloads and ~15% for decode-heavy ones.

More aggressive trimming yielded even better results. With Lang+Freq-Trim (20%), we achieved ~16% speedup in prefill-heavy tasks and ~24% in decode-heavy scenarios—all with minimal impact on quality. At the most aggressive setting—Lang+Freq-Trim (60%)—decode-heavy performance improved by ~50%, demonstrating how heavily embedding size can bottleneck runtime in on-device environments.

Figure 6. Inference speed of Gemma-3-1b-it (W4A16)

Figure 6 shows the most dramatic gains with Gemma-3-1b-it-w4a16, thanks to its massive 256k vocabulary and broad multilingual token coverage. The impact of Lang-Trim was especially strong—cutting end-to-end latency by ~15% in prefill-heavy workloads and ~21% in decode-heavy scenarios.

Frequency-based trimming pushed these gains even further. At Lang+Freq-Trim (60%), latency dropped by ~38% in prefill-heavy and ~59% in decode-heavy workloads—clearly demonstrating the outsized cost of large embedding tables in real-world inference.

Figure 7. Inference benchmark results of Llama-3.2-1B-Instruct (W4A16)

Unlike the previous models, Llama-3.2-1B-Instruct-w4a16 showed only modest improvements at first. With a relatively smaller 128k vocabulary and limited multilingual coverage, Lang-Trim led to just an 8.7% reduction in size—translating to only ~5% latency gain in prefill-heavy and ~8% in decode-heavy workloads (see Figure 7).

Still, more aggressive trimming paid off. At Lang+Freq-Trim (60%), latency improved by ~29% (prefill-heavy) and ~44% (decode-heavy)—a meaningful gain despite the smaller vocab. That said, there’s a clear trade-off here. The “safe” Lang+Freq-Trim (20%) setting gave only ~11-17% speedups. Pushing beyond that will likely require accepting some quality degradation, especially for the Llama-3 family.

Figure 8. Inference benchmark results of EXAONE-3.5-2.4B-Instruct (W4A16)

Vocabulary trimming had the least impact on EXAONE-3.5-2.4B-Instruct-w4a16. With a total of 2.4 billion parameters, its embedding layer represents a much smaller slice of the overall model compared to smaller SLMs. In addition, as shown in Figure 4, the size reduction rate of Lang-Trim was only 3.7% since the model is already optimized for English and Korean. As a result, Lang-Trim reduced the total model size by just 1.2%, leading to marginal speedups: ~1% in prefill-heavy and ~2% in decode-heavy workloads (see Figure 8).

While more aggressive Freq-Trim did push decode-heavy speedups up to ~29%, that gain comes with a cost—model quality began to drop early, meaning that even small trims can have outsized effects for language-specific models.

Overall, the results point to a clear trend: as vocabulary size decreases, both prefill and decode speeds improve—consistently across all tested models. While the magnitude of the gains varies depending on model architecture and vocab composition, one conclusion holds: vocabulary trimming is a practical and effective way to accelerate SLMs, especially in latency-sensitive environments like on-device inference.

Conclusion

Our experiments show that vocabulary trimming is a high-impact strategy for accelerating SLM inference—delivering up to 1.585× decode speedups and significantly improving on-device latency.

Here are two key takeaways:

Start with Language-Based Trimming.

For services targeting specific languages, removing irrelevant tokens is a safe, lossless first step. For example, Gemma-3-1B-it saw over 20% latency reduction with this alone.

Use Frequency-Based Trimming for Maximum Gains.

When latency is critical, trimming low-frequency tokens yields even bigger speedups—up to 1.585× in Gemma-3-1b-it and 1.491× in Qwen3-0.6B. But this comes with a quality degradation, so aggressive trimming must be carefully validated.

Vocabulary optimization isn’t a minor tweak—it’s a practical, high-leverage tool for making small models faster and more deployable in real-world scenarios.