AMD XDNA2 NPU Backend — Changelog

Custom llama.cpp backend offloading GGML_OP_MUL_MAT to the AMD XDNA2 NPU via XRT kernel dispatch.  ·  GitHub

Ryzen AI MAX 385  ·  RyzenAI-npu5 (XDNA2)  ·  Meta-Llama-3.1-8B-Instruct Q4_K_M
8

Performance Ceiling Investigation

Batch decode sweep · int4 kernel · cascade · speculative decoding · 2026-03-26

⚠ Ceiling Confirmed
Conclusion

43.7 t/s is the hard performance ceiling for decode on this hardware+model combination. All four investigated approaches (batch decode, int4 quantisation, kernel cascade, speculative decoding) converge on the same bottleneck: LPDDR5 memory bandwidth serving the KV cache during single-token generation. No software technique can overcome a bandwidth-bound memory wall.

8A — Batch Decode Sweep (N=1..64)

Swept batch decode from N=1 to N=64 to test whether increasing decode batch size could improve throughput. Result: completely flat at 43.7 t/s across all batch sizes.

Batch NDecode t/sChange
1 (baseline)43.7 t/s
843.7 t/sflat
3243.7 t/sflat
6443.7 t/sflat

Diagnosis: Throughput is entirely dominated by reading the KV cache from LPDDR5 on every decode step. Increasing N multiplies KV reads proportionally; total bandwidth demand rises with N but peak bandwidth is fixed. The NPU's compute is saturated at N=1 — there is no headroom for batching to exploit.

8B — Int4 Weight Quantisation & Cascade Kernel
Dead Ends
  • Int4 (double-quantisation): Q4_K_M weights are already 4-bit. Applying a second int4 quantisation layer (int8→int4) produced SNR drop from 44.8 dB to 19.7 dB — unacceptable quality loss with no throughput gain.
  • matrix_vector kernel: Requires i16 activations; our pipeline produces i8. Wrong direction for this architecture — would require rebuilding the entire quantisation pipeline.
  • Cascade design: AMD mlir-aie documentation explicitly states cascaded AIE array designs provide no throughput gain for memory-bandwidth-bound workloads. Ruled out before implementation.

Sanity-check tooling added: tools/int4_quant_sanity.c and tools/int4_quant_sanity.py — measure SNR of the quantisation pipeline at various bit widths without requiring a full model run.

8C — Speculative Decoding

Speculative decoding amortises the cost of the large target model by having a small, fast draft model propose multiple tokens at once. The target model then verifies the entire batch in a single forward pass (which is much cheaper than N separate forward passes).

Tokenizer compatibility issue: TinyLlama-1.1B uses SentencePiece (32k vocab) — incompatible with Llama-3.1-8B (tiktoken, 128k vocab). Llama-3.2-1B-Instruct was used instead (same llama-bpe 128k tokenizer).

MetricValue
n_drafted40
n_accept1 (2.5%)
Effective t/s41.0 t/s

2.5% acceptance rate — near-zero. n-gram lookup only works on repetitive/templated text (code generation, boilerplate). Generative explanatory prose has essentially no repeating n-grams. Ruled out.

First run without explicit context cap: target model consumed 8 GB of VRAM for a 64k-token KV cache, leaving no VRAM for the draft model. Draft ran on CPU at ~20 t/s — slower than the 43.7 t/s NPU decode. Effective throughput fell to 10.9 t/s.

Fix: set -c 8192 to cap context and free VRAM. Both models then offloaded fully (target 33/33 layers, draft 17/17 layers to Vulkan/iGPU).

ConfigAccept %Draft speedEffective t/svs baseline
Draft on CPU (no ctx cap) 59.0% 20 t/s 10.9 t/s −75%
Draft on GPU (-c 8192) 44.4% 212 t/s 41.6 t/s −5%
Baseline (no speculative) 43.7 t/s

Why it still doesn't help: The verification step in speculative decoding processes n_draft + 1 tokens as a batch through the target model. This is a prefill operation — it reads the KV cache once for all tokens. But the KV cache read is still bounded by LPDDR5 bandwidth. Verifying 9 tokens at once does not cost 9× less than verifying 1 — the KV cache I/O per step is similar, just spread across more tokens. The bandwidth wall applies to both single-token and batch-token verification.

A 44.4% acceptance rate with a 212 t/s draft model would normally yield ~1.8× speedup. The absence of any gain confirms the bottleneck is not in compute but purely in memory bandwidth.

The 43.7 t/s ceiling is hardware-fundamental. The LPDDR5 memory subsystem on the Ryzen AI MAX 385 delivers a fixed bandwidth budget. At decode time, every token requires reading the full KV cache (32 layers × 1024 bytes per token × context length). No matter how the compute is organised — single-token NPU dispatch, batched Vulkan verification, or speculative multi-token drafts — the bandwidth demand is structurally the same.

The only path to higher effective throughput at this level would be a hardware change: faster memory (LPDDR5X at higher bandwidth), a reduced-precision KV cache (fp8 or int8 KV — not yet supported in this build), or a fundamentally different model architecture with a smaller KV footprint (e.g. MLA as used in DeepSeek).

All phases  ·  Phase 7 — NPU Decode Acceleration