Vocabulary Trimming: An Easy and Effective Method for SLM Acceleration
Trimming large multilingual vocabularies in Small Language Models (SLM) is a simple, low-risk way to boost efficiency to its limit. It accelerates the model inference significantly while keeping accuracy almost unchanged.
Aug 04, 2025
Introduction
The shift towards massive, multilingual language models has led to correspondingly massive vocabularies. State-of-the-art models are trained on diverse corpora covering dozens of languages, resulting in vocabulary sizes that often exceed 100K tokens. For example, Llama3 models have a vocabulary size of over 128K tokens, while
Qwen-3
and Gemma-3
come with vocabularies of approximately 150K and 256K tokens, respectively.In large-scale models with tens of billions of parameters, this isn't a concern. The embedding layer is minor in terms of overall model size. However, for smaller models ranging from 0.5B to 2B parameters (Small Language Models, SLM), the situation is drastically different. Here, the embedding layer is no longer a minor component—it becomes a significant portion of the entire model. While the embedding layer is a negligible fraction of the large models, it can account for roughly 25-30% of the total model size in smaller models such as
Qwen3-0.6B
and Gemma-3-1b-it
.Moreover, the embedding layer's share of the total model size becomes even larger when the model is quantized for better efficiency. Applying weight-only quantization schemes like GPTQ or AWQ substantially reduces weight parameters, but leaves the embedding layer unchanged, thereby making its proportional contribution even more substantial.
Figure 1 shows the ratio of the embedding layer size to the total model size across real models of various scales.

Based on the observation, we can think that by optimizing the embedding layer we can meaningfully reduce overall memory usage and improve inference speed. In this article, we introduce a practical strategy to remove 'dead tokens' from the vocabulary for embedding compression.
Dead tokens are tokens never used during inference. For example, when deploying one of those powerful multilingual models like
Qwen-3
or Gemma-3
for a service that operates only in English or Korean, tokens related to other languages-such as Chinese, Japanese, Arabic, and dozens of others—remain completely inactive. In such cases, over 40% of the vocabulary may be functionally useless, wasting memory and compute cycles without contributing to model output. If we can get rid of those dead tokens, then we might achieve following benefits:- Reduced Memory Footprint: When we reduce the embedding size, the most straightforward benefit is the reduced memory requirement. Removing dead tokens reduces VRAM consumption, making it feasible to deploy SLMs on resource-constrained environments.
- Faster LM Head Computation: Most SLMs use tied embeddings for parameter efficiency, therefore reducing the vocabulary directly shrinks the compute cost of logits for every single generated token. This effectively accelerates the token generation process, especially during the decode phase.
- Improved Tokenizer Efficiency: With the reduced tokenizer file size and vocabulary, there would be significant improvements in overall tokenization speed, including both loading and processing times.
So, can we safely remove the dead tokens to achieve these performance gains without model quality degradation? The short answer is YES; Embedding Compression via Vocabulary Trimming.
Embedding Compression via Vocabulary Trimming
To realize the efficiency gains discussed earlier, we leverage vocabulary trimming—a straightforward technique proposed in previous researches (Ushio et al., 2023; Vicenti et al., 2024; Goel et al., 2025). We thought that it would be useful for recent SLMs, especially in on-device environments where memory usage and latency are critical. An overview of this technique is shown in Figure 2.

Then, how do we identify dead tokens? Here we explore two different—yet compatible—strategies that can be used together.

Strategy 1: Language-based Trimming
The most intuitive and conservative approach is language-based trimming, a nearly lossless compression strategy. The goal is simple: to remove all tokens from languages that do not align with the application's target requirements.
For a service focused on English and Korean, this means we can eliminate the vast majority of tokens corresponding to other languages. The process involves the following steps:
- Unicode-based Token Identification: Analyze each token's Unicode characters to determine its language, then retain only tokens from target languages and special tokens while filtering out all others.
- Filtered Vocabulary Construction: Build a new, compact vocabulary file and tokenizer configuration containing only the essential tokens identified in the previous step.
- Embedding Extraction: Extract the embedding vectors corresponding to our target language tokens from the original model and transfer these weights to a new, smaller embedding layer.
- Layer Replacement: Replace the model's original embedding layer with the newly constructed, optimized version.
This strategy is a safe and highly effective step, as it only removes tokens that would have gone unused anyway.
Strategy 2: Frequency-based Trimming
Even within a target language, rarely-used tokens still consume memory while contributing little to the model output. Frequency-based Trimming addresses this inefficiency through a lossy compression strategy that can work in combination with language-based trimming.
The implementation follows three steps:
- Target Corpus Preparation: Prepare a large, representative text corpus for each target language to serve as the basis for frequency analysis.
- Token Frequency Analysis: Tokenize the entire corpus using the original tokenizer and calculate the frequency distribution of all tokens across the dataset.
- Rare Token Removal: Identify the least frequent tokens (e.g., bottom 5% or 10%) as "rare" and remove them from the vocabulary. The removal threshold can be adjusted arbitrarily.
However, this approach involves a critical trade-off: aggressive trimming boosts inference speed but increases Out-of-Vocabulary encounters, potentially causing quality degradation. This makes it crucial to find the right balance for your specific needs.
To validate these trimming strategies, we conducted a series of experiments to measure their practical impact on model quality and inference speed.
Experiment Setup
Models
We selected three SLMs with large, multilingual vocabularies to evaluate the impact across different architectures.
To further test the generalizability of our methods, we also included
EXAONE-3.5
, a model with a vocabulary already specialized for English and Korean. This allowed us to investigate whether frequency-based trimming remains effective even on a tailored vocabulary.We used 4-bit weight-only quantized version of these models for our evaluation. For quantization, group-wise symmetric round-to-nearest method was used with group size of 64.
Terminology
We started with a baseline version and applied vocabulary trimming strategies step-by-step to create several compressed versions:
Baseline (B)
: The original, unmodified model.
Lang-Trim (L)
: The model after applying language-based trimming (retaining only English and Korean tokens)
Lang+Freq-Trim (LFn)
: The model after applying language-based trimming and then frequency-based trimming at various thresholds. (n = 5, 10, 20, 40, 60%)
Dataset for Trimming
For our frequency-based trimming strategy, we prepared a balanced corpus with 1 billion tokens by sampling 500 million tokens each from the English C4 dataset and the Korean textbook dataset. This ensured that the frequency distribution is not biased toward one language.
Experimental Results
Figure 4 shows the final vocabulary sizes for each model after applying vocabulary trimming.

Quality Evaluation
We first evaluated whether vocabulary trimming affects the model’s core reasoning and linguistic capabilities. Since our optimization target was to preserve performance in English and Korean, we limited our evaluation to benchmarks in those two languages. We used the
lm-evaluation-harness
to assess quality of each compressed models on several key benchmark in a zero-shot setting:- English:
ARC-Challenge
,MMLU
,HellaSwag
- Korean:
HAE-RAE
,KOBEST
,KMMLU
Quality Evaluation Results



For the multilingual models, we observed two clear patterns. As shown in Tables 1-3, the
Lang-Trim
strategy behaved as expected—removing unused (dead) tokens with negligible impact on accuracy, which can be attributed to the occasional presence of dead tokens (e.g., Chinese Character) in the benchmark datasets and some minor artifacts from log-likelihood calculations.Building on that, the
Lang+Freq-Trim
approach exceeded our expectations. By trimming up to 40% of the remaining tokens based on frequency—even within the target language—we saw minimal drop in benchmark scores. However, when pushing past the 60% mark, model quality began to deteriorate noticeably. These results suggest that while moderate trimming is both safe and effective, aggressive vocab trimming must be approached with care—balancing memory savings against potential quality loss.
In contrast, the
EXAONE-3.5-2.4B-Instruct-w4a16
showed different output (see Table 4). Since its vocabulary was already optimized, the size reduction rate of Lang-Trim
was only 3.7%. Furthermore, while the multilingual models maintained consistent quality with up to Lang+Freq-Trim (20%)
, it began to show quality degradation on Korean benchmarks after trimming just the bottom 5% of rare tokens. This means trimming is much riskier for tailored model than for multilingual models, requiring greater caution when applying vocabulary trimming techniques.This highlights a key insight: vocabulary trimming is significantly riskier for tailored models than for broad multilingual ones. For models with handcrafted or domain-specific vocabularies, even minor pruning can harm performance—making careful analysis essential before applying these techniques.
Inference Speed Evaluation
Now that we’ve confirmed vocabulary trimming is safe in terms of quality, the next question is whether it actually improves inference speed. Since SLMs are typically used in resource-constrained environments, we chose to run our tests on a mobile phone instead of a server—allowing us to measure the real-world impact of trimming on latency, memory, and runtime performance.
Hardware and Framework Specification
- Device: Apple iPhone 15 Pro (iOS 18.5)
- Framework:
MLX-Swift
(0.25.6)
Workload Configurations
We tested our models under two distinct workload configurations.
- Prefill-Heavy: Input length of 512 and output length of 64
- Decode-Heavy: Input length of 64 and output length of 512
To simulate realistic usage, we generated random token sequences and ran each test multiple times, reporting the average latency across runs. In line with typical on-device scenarios, all measurements were taken with a batch size of 1.
Results

Figure 5 highlights the performance gains observed on
Qwen3-0.6B-w4a16
, a model where the embedding layer accounts for 57% of total parameters. Even the initial Lang-Trim
strategy reduced the overall model size by ~16%, resulting in ~10% speedup for prefill-heavy workloads and ~15% for decode-heavy ones. More aggressive trimming yielded even better results. With
Lang+Freq-Trim (20%)
, we achieved ~16% speedup in prefill-heavy tasks and ~24% in decode-heavy scenarios—all with minimal impact on quality. At the most aggressive setting—Lang+Freq-Trim (60%)
—decode-heavy performance improved by ~50%, demonstrating how heavily embedding size can bottleneck runtime in on-device environments.
Figure 6 shows the most dramatic gains with
Gemma-3-1b-it-w4a16
, thanks to its massive 256k vocabulary and broad multilingual token coverage. The impact of Lang-Trim
was especially strong—cutting end-to-end latency by ~15% in prefill-heavy workloads and ~21% in decode-heavy scenarios.Frequency-based trimming pushed these gains even further. At
Lang+Freq-Trim (60%)
, latency dropped by ~38% in prefill-heavy and ~59% in decode-heavy workloads—clearly demonstrating the outsized cost of large embedding tables in real-world inference.
Unlike the previous models,
Llama-3.2-1B-Instruct-w4a16
showed only modest improvements at first. With a relatively smaller 128k vocabulary and limited multilingual coverage, Lang-Trim
led to just an 8.7% reduction in size—translating to only ~5% latency gain in prefill-heavy and ~8% in decode-heavy workloads (see Figure 7).Still, more aggressive trimming paid off. At
Lang+Freq-Trim (60%)
, latency improved by ~29% (prefill-heavy) and ~44% (decode-heavy)—a meaningful gain despite the smaller vocab. That said, there’s a clear trade-off here. The “safe” Lang+Freq-Trim (20%)
setting gave only ~11-17% speedups. Pushing beyond that will likely require accepting some quality degradation, especially for the Llama-3 family.
Vocabulary trimming had the least impact on
EXAONE-3.5-2.4B-Instruct-w4a16
. With a total of 2.4 billion parameters, its embedding layer represents a much smaller slice of the overall model compared to smaller SLMs. In addition, as shown in Figure 4, the size reduction rate of Lang-Trim
was only 3.7% since the model is already optimized for English and Korean. As a result, Lang-Trim
reduced the total model size by just 1.2%, leading to marginal speedups: ~1% in prefill-heavy and ~2% in decode-heavy workloads (see Figure 8).While more aggressive
Freq-Trim
did push decode-heavy speedups up to ~29%, that gain comes with a cost—model quality began to drop early, meaning that even small trims can have outsized effects for language-specific models.Overall, the results point to a clear trend: as vocabulary size decreases, both prefill and decode speeds improve—consistently across all tested models. While the magnitude of the gains varies depending on model architecture and vocab composition, one conclusion holds: vocabulary trimming is a practical and effective way to accelerate SLMs, especially in latency-sensitive environments like on-device inference.
Conclusion
Our experiments show that vocabulary trimming is a high-impact strategy for accelerating SLM inference—delivering up to 1.585× decode speedups and significantly improving on-device latency.
Here are two key takeaways:
- Start with Language-Based Trimming.
For services targeting specific languages, removing irrelevant tokens is a safe, lossless first step. For example,
Gemma-3-1B-it
saw over 20% latency reduction with this alone.- Use Frequency-Based Trimming for Maximum Gains.
When latency is critical, trimming low-frequency tokens yields even bigger speedups—up to 1.585× in
Gemma-3-1b-it
and 1.491× in Qwen3-0.6B
. But this comes with a quality degradation, so aggressive trimming must be carefully validated.Vocabulary optimization isn’t a minor tweak—it’s a practical, high-leverage tool for making small models faster and more deployable in real-world scenarios.
Share article
Join the SqueezeBits newsletter today!