Disaggregated Inference on Apple Silicon: NPU prefill and GPU decode
In this article, we introduce how to run LLMs efficiently on Apple Silicon with disaggregated inference technique.
Aug 26, 2025
Contents
1. Introduction2. Hidden Challenges Facing the ANEToo Many States Will Kill You!Make State Dimension Sizes Powers of TwoUpdate States with CareDouble-check The ANE DelegationDon’t Let The pow InBreak Down SDPA for Long Sequences3. MLX vs. Core ML: When to Use EachBenchmarksKey Takeaways4. Yetter Inference Engine - Disaggregated Inference for Apple SiliconEnd-to-end Latency ComparisonsKey Takeaways5. Future Works1. Introduction
As demand increases for running large language models (LLMs) on personal devices to improve privacy and offline use, Apple Silicon has emerged as a robust hardware platform with its unified CPU, GPU, and Neural Engine (ANE) architecture. While standard frameworks like PyTorch are compatible, fully leveraging Apple's hardware requires specialized tools like MLX and Core ML. MLX is a flexible, developer-friendly framework inspired by PyTorch and NumPy, featuring a unified memory model that closely aligns with Apple Silicon's design, enabling operations on the CPU or GPU without data transfer. However, MLX's main limitation is the lack of support for the Apple Neural Engine (ANE).
To harness the power-efficient, high-throughput performance of the ANE, developers need to use Apple's Core ML framework. Core ML is less flexible than MLX and requires models to have fixed input shapes for ANE compatibility. The multifunction model feature is crucial for deploying language models within this constraint; it enables you to combine separate fixed-shape functions, such as one for prefill and another for decode, into a single package that shares weights, thereby keeping the model file size small. Additionally, the introduction of stateful models on iOS 18 and macOS 15 provides an efficient way to manage the key-value (KV) cache by allowing the model to maintain a persistent, updatable buffers during runtime. Together, these specialized features facilitate deploying language models on the ANE, overcoming the framework's inherent rigidity.
In this post, we will first explore the many undocumented challenges of running LLMs on the ANE and share practical solutions. We then provide a detailed performance comparison between MLX (GPU) and Core ML (ANE) to establish a baseline and highlight their unique benefits. Building on these insights, we introduce our Yetter Inference Engine, a new system that employs a disaggregated inference approach to leverage the strengths of both the ANE and GPU for better performance. Finally, we share comprehensive benchmarks for our engine and discuss future directions for this work.
2. Hidden Challenges Facing the ANE
Despite Core ML’s exclusive features mentioned earlier, many undocumented challenges still exist that must be addressed before fully utilizing the ANE. Here are some of these challenges along with workaround solutions identified through painstaking debugging.
Too Many States Will Kill You!
Even if Core ML’s state makes it easier to handle the KV cache efficiently, using it for the ANE is more complicated than you might think. For example, if your model has too many states, it could fail to compile for the ANE. To fix this, you need to implement custom Hugging Face cache interfaces (Cache and CacheLayerMixin) to manage key/value caches internally for all layers as a single concatenated tensor with shape
(num_hidden_layers, num_key_value_heads, max_cache_len, head_dim)
, assuming batch_size=1
. For instance, in the case of Qwen3-0.6B, the custom cache interface results in only two states in the converted Core ML model—key_cache
and value_cache
as shown in Figure 1. and Figure 2.—compared to 56 states when each layer-wise cache was treated as a separate state.Apple’s official blog post, “On Device Llama 3.1 with Core ML,” provides an example implementation of the cache interface. However, it relies on outdated transformer APIs and is therefore incompatible with the most recent versions. Most critically, it is designed for GPU acceleration rather than ANE.

max_seq_len=512
. Only two states are included by using the custom cache interface.
max_seq_len=512
. Only two states are included by using the custom cache interface.However, this custom implementation conflicts with the recent change in Hugging Face’s per-layer cache interface. As a result, this approach needs further adjustments for advanced caching mechanisms with heterogeneous cache layers, such as sliding window attention used in google/gemma-3-1b-it.
Make State Dimension Sizes Powers of Two
Furthermore, if any of the states has a dimension (except the very first one) whose size is not a power of 2, then such a model might encounter a runtime error. For example, when LGAI-EXAONE/EXAONE-3.5-2.4B-Instruct is packaged as a stateful model, its state shape is (20, 8, 512, 80) - note that 80 is not a power of two. When you run this model with a Core ML Swift API, specifically MLModel's prediction method, you’ll see an error as follows:
Error Domain=com.apple.CoreML Code=0 "Unable to compute the prediction using ML Program. It can be an invalid input data or broken/unsupported model."
The runtime error can be fixed by padding the states to have shape (20, 8, 512, 128). To prevent similar problems in the future, the cache interface should internally pad the non-first dimensions of cache tensors to the nearest powers of two and then only use the unpadded parts of these tensors.
Update States with Care
Even models that already meet the power-of-two constraint mentioned above may still face compilation failures with the following unhelpful error message.
Error Domain=com.apple.CoreML Code=0 "Failed to build the model execution plan using a model architecture file"
It turns out that adding a small constant value right before updating the value caches, as shown in Figure 3., can effectively address this issue. Theoretically, zero is the ideal choice for this small value to preserve numerical correctness. However, the addition by zero is automatically eliminated by the PyTorch JIT tracer’s graph optimizer. Hence, for example,
torch.finfo(torch.float32).smallest_normal
is a sensible choice for the small number, which is truncated to zero when it is casted to float16 precision by Core ML’s graph optimization. Notably, this trick is only required for value cache updating in the prefill graph.![Figure 3. [Left] The value cache updating subgraph visualized by Netron. The add node (highlighted by the red box) is inserted to prevent the compilation failure. [Right] The key cache updating subgraph. In most models, a rotary positional embedding (RoPE) is applied to the key states before updating the key caches. The add node created by the RoPE (highlighted by the orange box) is likely preventing the compilation failure.](https://image.inblog.dev?url=https%3A%2F%2Fwww.notion.so%2Fimage%2Fattachment%253A2221a434-dbaf-48d3-911e-ab4ead7a1040%253AKV_Cache_Workaround.png%253FspaceId%253D23f4b38d-2def-440d-b962-b485f3d7fb97%3Ftable%3Dblock%26id%3D25a258ac-0943-8094-95ec-e4409e9a5cad%26cache%3Dv2&w=1920&q=75)
add
node (highlighted by the red box) is inserted to prevent the compilation failure. [Right] The key cache updating subgraph. In most models, a rotary positional embedding (RoPE) is applied to the key states before updating the key caches. The add
node created by the RoPE (highlighted by the orange box) is likely preventing the compilation failure.Double-check The ANE Delegation
After applying a few more adjustments—beyond what could be included in this post—the model's prefill and decode phases can finally be compiled for the ANE. Nonetheless, it remains essential to verify that each operation within the model is delegated to the ANE. Such verification can be effectively done by generating a performance report via XCode that shows operation-wise device delegation status and performance metrics. As illustrated in Figure 4. and Figure 5., nearly all operations within both the prefill and decode phases are executed on the ANE with only a few exceptions, namely the token embedding layer (represented as a gather operation), the LM head (represented as a linear operation), and a few minor element-wise operations.


Notice that the XCode performance reports presented above are comparable to those presented in Apple’s official blog post “Deploying Transformers on the Apple Neural Engine”, despite the fact that the models discussed therein are predominantly traditional sequence classification models, such as distilbert/distilbert-base-uncased-finetuned-sst-2-english.
Don’t Let The pow
In
Hold on! It’s not over yet. Even if the model is up and running on the ANE, it might generate a broken text as a result of numerical instability as described in Figure 6.
![Figure 6. [Left] Llama-3.2-1B-Instruct produces only new line characters before applying the graph optimizations. [Right] The model generates valid sentences after the graph optimizations (right).](https://image.inblog.dev?url=https%3A%2F%2Fwww.notion.so%2Fimage%2Fattachment%253Ac8784b39-9421-416f-9138-3259955dfc90%253Achat_comparison.png%253FspaceId%253D23f4b38d-2def-440d-b962-b485f3d7fb97%3Ftable%3Dblock%26id%3D25a258ac-0943-8062-988c-fd531b4aa32f%26cache%3Dv2&w=1920&q=75)
Surprisingly, this numerical instability is primarily caused by
pow
operations derived from the torch.Tensor.pow
method used in RMSNorm or LayerNorm layers as shown in Figure 7. Instead of manually amending Hugging Face’s model implementations, it is more plausible to incorporate a custom Model Intermediate Language (MIL) graph pass into the Core ML converter that rewrites pow(x, 2)
as mul(x, x)
. Furthermore, an additional graph pass can be implemented to fold the add
operation within the RMSNorm subgraph into the epsilon
parameter of the subsequent rsqrt
operation.![Figure 7. An RMSNorm subgraph in Llama-3.2-1B-Instruct model. [Left] The original computation graph involves a numerically unstable pow node and a redundant add node. [Right] The resulting computation graph after applying the custom MIL graph pass that replaces the pow node with the equivalent mul node and fuses the add node into its subsequent rsqrt node’s parameter.](https://image.inblog.dev?url=https%3A%2F%2Fwww.notion.so%2Fimage%2Fattachment%253A2a52e175-7a1c-4c93-81ad-bc397821352c%253Arms_norm_comarison.png%253FspaceId%253D23f4b38d-2def-440d-b962-b485f3d7fb97%3Ftable%3Dblock%26id%3D25a258ac-0943-801d-ad42-c938aafcd573%26cache%3Dv2&w=1920&q=75)
pow
node and a redundant add
node. [Right] The resulting computation graph after applying the custom MIL graph pass that replaces the pow
node with the equivalent mul
node and fuses the add
node into its subsequent rsqrt
node’s parameter.Break Down SDPA for Long Sequences
Finally, the model is running on the ANE and generating valid outputs. However, if the model is compiled with an increased maximum sequence length (e.g., 1024 or 2048), you may encounter the familiar compilation error: “Failed to build the model execution plan using a model architecture file”. Given the context, it is reasonable to deduce that the cause of this failure is the significant memory requirement associated with executing the
scaled_dot_product_attention
operation with extensive sequence lengths.Unexpectedly, the unused MIL graph pass
scaled_dot_product_attention_sliced_q
is a hidden gem in Core ML Tools capable of addressing this issue. Its docstring describes its functionality as follows:@register_pass(namespace="common")
class scaled_dot_product_attention_sliced_q(AbstractGraphPass):
"""
Replace the ios18.scaled_dot_product_attention operation with a memory efficient
implementation of attention calculation based on slicing Q. The benefits are clearly
visible for higher Q sequence lengths, though.
Graph pass options:
- min_seq_length: int
Only operations working with Q of sequence length greater or equal to this value will be transformed.
- seq_length_divider: int
Defines the size of the chunks of Q being processed in SDPA (chunk_size = seq_length / seq_length_divider)
"""
Adding this pass (with a few modifications) to the Core ML conversion pipeline should eliminate the compilation error for long sequence lengths.
3. MLX vs. Core ML: When to Use Each
Is it worth the enormous effort to run language models on the ANE with Core ML? While MLX is certainly a better choice for a handy deployment on Apple silicon, the following benchmark results demonstrate that each has its own strengths and weaknesses.
Benchmarks
We measured the Time To First Token (TTFT) and Time Per Output Token (TPOT) for several models utilizing both MLX (GPU) and Core ML (ANE) under two different scenarios.
- Prefill-heavy scenario (448+64): the model generates 64 tokens given 448 input tokens
- Decode-heavy scenario (64+448): the model generates 448 tokens given 64 input tokens
For each scenario, the time taken by the tokenizer and the model is measured separately to understand their contributions to the TTFT since the tokenizer’s encoding overhead is quite significant compared to the model’s prefill latency in most cases. Meanwhile, only the overall TPOT is measured because the tokenizer’s decoding overhead is negligible.
Regarding the Core ML models, input sequences are padded to the maximum sequence length of 512 and the model weights are quantized to INT4 precision with per-channel granularity, while activations are maintained in FP16. Similarly, for the MLX models, the weights are quantized to INT4 precision with per-group granularity, using a group size of 64, whereas activations are preserved in either FP16 or BF16, depending on the models’ configurations.
In all cases, the chat templates are applied by the tokenizers to better simulate realistic deployment settings. The performances are measured on iPhone 15 Pro with iOS 18.6.1.
Qwen3-0.6B

Llama-3.2-1B-Instruct

Key Takeaways
In prefill-heavy scenarios, Core ML using the ANE can substantially enhance the TTFT with minimal drawbacks in decode-heavy scenarios. However, MLX running on the GPU consistently outperforms in terms of TPOT across both prefill-heavy and decode-heavy scenarios. It is also worth noting that the tokenizer encoding latency varies among different models, representing another factor in overall performance.
4. Yetter Inference Engine - Disaggregated Inference for Apple Silicon
Looking at the performance of MLX and Core ML shows a clear chance to improve. MLX and the GPU excel during decoding, while Core ML and the ANE are better for prefill. This suggests a strong idea: why not use both? This is the core concept of our Yetter Inference Engine for Apple Silicon. We disaggregate the inference process, designating each stage to the hardware best suited for it.
However, building a foundation for integrating Core ML and MLX involves costs. Although some popular open-source projects for running LLMs on Apple Silicon exist, most focus either on CPU/GPU usage (e.g., llama.cpp) or solely on the ANE (e.g., Anemll). Therefore, besides supporting pure ANE inference with Core ML and GPU inference with MLX, we have added support for disaggregated inference that uses both ANE and GPU in our Yetter Inference Engine. This method maximizes LLM performance on Apple Silicon by assigning the prefill stage to Core ML, which handles it on the ANE to utilize its high-throughput capacity, and then directing the decode stage to MLX, managing the iterative generation of new tokens on the GPU to take advantage of its fast decoding speed.
To support various language models with minimal effort, the Yetter Inference Engine features a streamlined conversion tool that can transform any Hugging Face language model into a single, multifunction Core ML package. The conversion tool can package the model as either stateful (for pure ANE inference) or stateless (for disaggregated prefill, where the KV cache tensors are treated as outputs).
End-to-end Latency Comparisons
To thoroughly evaluate our approach alongside existing ones—Core ML and MLX—we measured end-to-end latencies for various models using each approach—MLX, Core ML, and our Yetter Inference Engine (i.e., disaggregated inference)—under both prefill-heavy and decode-heavy scenarios. The specific settings remain the same as in the previous experiments.
Qwen3-0.6B

Llama-3.2-1B-Instruct

EXAONE-4.0-1.2B

HyperCLOVAX-SEED-Text-Instruct-1.5B

kanana-1.5-2.1b-instruct-2505

Key Takeaways
As shown above, Yetter Inference Engine can support various models on Apple Silicon with consistent performance profiles. Compared to MLX, it significantly reduces prefill latency, which is only slightly higher than Core ML’s because Yetter uses stateless prefill models, whereas Core ML can utilize stateful ones. Its decode latencies are nearly identical to MLX's. This allows Yetter to consistently outperform both MLX and Core ML in prefill-heavy scenarios. In decode-heavy scenarios, Yetter performs similarly to MLX while notably surpassing Core ML.
5. Future Works
The success of this disaggregated inference model opens up exciting possibilities for future work. We are exploring ways to extend this approach by using different levels of numerical precision for the prefill and decode stages, which could provide additional performance improvements. Another direction involves developing a system that dynamically routes workloads between the GPU and ANE based on the specific task and device condition. We also plan to expand support for more model architectures and intend to open-source the engine to encourage community collaboration and feedback.
Share article
Join the SqueezeBits newsletter today!