AMD XDNA2 NPU Backend — Phase 8 Changelog

Performance Ceiling Investigation

Batch decode sweep · int4 kernel · cascade · speculative decoding · 2026-03-26

⚠ Ceiling Confirmed

Conclusion

43.7 t/s is the hard performance ceiling for decode on this hardware+model combination. All four investigated approaches (batch decode, int4 quantisation, kernel cascade, speculative decoding) converge on the same bottleneck: LPDDR5 memory bandwidth serving the KV cache during single-token generation. No software technique can overcome a bandwidth-bound memory wall.

Sub-phases investigated

8A — Batch Decode Sweep (N=1..64)

Swept batch decode from N=1 to N=64 to test whether increasing decode batch size could improve throughput. Result: completely flat at 43.7 t/s across all batch sizes.

Batch N	Decode t/s	Change
1 (baseline)	43.7 t/s	—
8	43.7 t/s	flat
32	43.7 t/s	flat
64	43.7 t/s	flat

Diagnosis: Throughput is entirely dominated by reading the KV cache from LPDDR5 on every decode step. Increasing N multiplies KV reads proportionally; total bandwidth demand rises with N but peak bandwidth is fixed. The NPU's compute is saturated at N=1 — there is no headroom for batching to exploit.

8B — Int4 Weight Quantisation & Cascade Kernel

Dead Ends

Int4 (double-quantisation): Q4_K_M weights are already 4-bit. Applying a second int4 quantisation layer (int8→int4) produced SNR drop from 44.8 dB to 19.7 dB — unacceptable quality loss with no throughput gain.
matrix_vector kernel: Requires i16 activations; our pipeline produces i8. Wrong direction for this architecture — would require rebuilding the entire quantisation pipeline.
Cascade design: AMD mlir-aie documentation explicitly states cascaded AIE array designs provide no throughput gain for memory-bandwidth-bound workloads. Ruled out before implementation.

Sanity-check tooling added: tools/int4_quant_sanity.c and tools/int4_quant_sanity.py — measure SNR of the quantisation pipeline at various bit widths without requiring a full model run.

8C — Speculative Decoding

Speculative decoding amortises the cost of the large target model by having a small, fast draft model propose multiple tokens at once. The target model then verifies the entire batch in a single forward pass (which is much cheaper than N separate forward passes).

Tokenizer compatibility issue: TinyLlama-1.1B uses SentencePiece (32k vocab) — incompatible with Llama-3.1-8B (tiktoken, 128k vocab). Llama-3.2-1B-Instruct was used instead (same llama-bpe 128k tokenizer).

Attempt 1: n-gram lookup cache (llama-lookup)

Metric	Value
n_drafted	40
n_accept	1 (2.5%)
Effective t/s	41.0 t/s

2.5% acceptance rate — near-zero. n-gram lookup only works on repetitive/templated text (code generation, boilerplate). Generative explanatory prose has essentially no repeating n-grams. Ruled out.

Attempt 2: Draft model — Llama-3.2-1B-Instruct Q4_K_M (GPU offloaded)

First run without explicit context cap: target model consumed 8 GB of VRAM for a 64k-token KV cache, leaving no VRAM for the draft model. Draft ran on CPU at ~20 t/s — slower than the 43.7 t/s NPU decode. Effective throughput fell to 10.9 t/s.

Fix: set -c 8192 to cap context and free VRAM. Both models then offloaded fully (target 33/33 layers, draft 17/17 layers to Vulkan/iGPU).

Config	Accept %	Draft speed	Effective t/s	vs baseline
Draft on CPU (no ctx cap)	59.0%	20 t/s	10.9 t/s	−75%
Draft on GPU (-c 8192)	44.4%	212 t/s	41.6 t/s	−5%
Baseline (no speculative)	—	—	43.7 t/s	—

Why it still doesn't help: The verification step in speculative decoding processes n_draft + 1 tokens as a batch through the target model. This is a prefill operation — it reads the KV cache once for all tokens. But the KV cache read is still bounded by LPDDR5 bandwidth. Verifying 9 tokens at once does not cost 9× less than verifying 1 — the KV cache I/O per step is similar, just spread across more tokens. The bandwidth wall applies to both single-token and batch-token verification.

A 44.4% acceptance rate with a 212 t/s draft model would normally yield ~1.8× speedup. The absence of any gain confirms the bottleneck is not in compute but purely in memory bandwidth.

Overall Finding

The 43.7 t/s ceiling is hardware-fundamental. The LPDDR5 memory subsystem on the Ryzen AI MAX 385 delivers a fixed bandwidth budget. At decode time, every token requires reading the full KV cache (32 layers × 1024 bytes per token × context length). No matter how the compute is organised — single-token NPU dispatch, batched Vulkan verification, or speculative multi-token drafts — the bandwidth demand is structurally the same.

The only path to higher effective throughput at this level would be a hardware change: faster memory (LPDDR5X at higher bandwidth), a reduced-precision KV cache (fp8 or int8 KV — not yet supported in this build), or a fundamentally different model architecture with a smaller KV footprint (e.g. MLA as used in DeepSeek).