AMD XDNA2 NPU Backend — Changelog

Custom llama.cpp backend offloading GGML_OP_MUL_MAT to the AMD XDNA2 NPU via XRT kernel dispatch. · GitHub

Ryzen AI MAX 385 · RyzenAI-npu5 (XDNA2) · Meta-Llama-3.1-8B-Instruct Q4_K_M

NPU Decode Acceleration + Vulkan Prefill

TILE_N=64 decode xclbins · MIN_N=1 MAX_N=1 routing · 2026-03-26

✅ Done

Wins

Decode +11.7× (3.76 → 43.84 t/s) Prefill +30× (31 → 931 t/s) NPU decode confirmed hardware Beats Vulkan: 0.947 vs 1.3 J/tok Lower power than Vulkan: 41.5 vs 52.2 W

Benchmark Results

Test	Phase 6	Phase 7	Backend	Change
Prefill pp512	30.91 t/s	930.92 t/s	Vulkan	+30×
Decode tg64	3.76 t/s	43.84 t/s	NPU	+11.7×

Power (decode, tg64)

Backend	Decode	Avg Power	J/tok
Phase 6 (CPU decode)	3.76 t/s	58.5 W	16.2 J/tok
Vulkan	41.6 t/s	52.2 W	1.3 J/tok
Phase 7 NPU decode	43.84 t/s	41.5 W	0.947 J/tok

What Changed

Built 4 decode xclbins with TILE_N=64 (minimum n=16 per AIE hardware constraint — n=8 rejected by kernel). Swapped all 4 .zshrc slots from ~/xclbin-4col/ to ~/xclbin-decode/.

Added MIN_N=1, MAX_N=1 to restrict NPU to single-token decode. Vulkan picks up all prefill ops (N>1) automatically via llama.cpp backend scheduling.

Confirmed NPU handles 100% of decode: GGML_VK_DISABLE=1 leaves decode at 42.90 t/s (unchanged), proving Vulkan contributes nothing to decode throughput.

NPU decode speed (~42 t/s) matches Vulkan decode (~43 t/s) — advantage is dedicated XDNA2 silicon leaves iGPU free for other workloads.

XDNA2 hw_context limit clarified: driver source (npu5_regs.c) shows hwctx_limit=16, not 4. Previous 8-slot failure was column exhaustion (8×4-col = 32 cols > 16 available), not a firmware context cap.

Multi-Core NPU (4-Column, TILE_N=256)

4× AIE column parallelism · TILE_N=256 prefill xclbins · all 4 K-slots

✅ Done

Wins

pp2048: 19.5 t/s (+51% vs Phase 5) pp512: 13.7 t/s (+34% vs Phase 5) All 4 K-slots upgraded to 4-col

Benchmark Results

Test	Phase 5	Phase 6	Change
Prefill pp512	10.2 t/s	13.7 t/s	+34%
Prefill pp2048	12.9 t/s	19.5 t/s	+51%
Prefill pp4096	11.7 t/s	16.2 t/s	+38%
Prefill pp8192	8.9 t/s	10.9 t/s	+22%

What Changed

Rebuilt all 4 K-slot xclbins (K=2048, 4096, 5632, 14336) with n_aie_cols=4, upgrading from 1-column to 4-column dispatch. Peak throughput at pp=2048 where tile utilisation is highest.

Power draw increased: 58.5 W average (vs 45.8 W Phase 5) — all 4 AIE columns active during prefill. Decode still CPU-only, causing high idle NPU power during generation.

Long Context & Context-Length Characterisation

8k context validated · attention fallback characterised · bench-context.sh tooling

✅ Done

Wins

8k context validated (no OOM) NPU 2–3× over CPU at all lengths Peak throughput at pp=2048

What Changed

Validated 8k context on 30 GiB RAM (KV cache ≈ 1 GiB, model ≈ 4.6 GiB). NPU throughput peaks at pp=2048 (full tile utilisation) and degrades at longer context because attention score matmuls (K=seq_len, variable-K) fall back to CPU — only fixed-K projection matmuls offload.

Added bench-context.sh tooling for sweep benchmarks across context lengths.

Power Measurement & Workload Isolation

NPU vs Vulkan power characterisation · bench-power.sh

✅ Done

Wins

NPU confirmed non-competing (XDNA2 silicon) Power baseline established

Power Comparison

Backend	Prefill	Decode	Avg Power	J/tok (decode)
NPU 4-col (Phase 6)	32.7 t/s	3.6 t/s	58.5 W	16.2 J/tok
Vulkan iGPU	632 t/s	41.6 t/s	52.2 W	1.3 J/tok

What Changed

Confirmed NPU runs on dedicated XDNA2 silicon — no GPU contention. NPU is preferred when iGPU is busy with other workloads. Added bench-power.sh for SoC PPT measurement.

8B Model Support (4 K-Slots)

K=4096 and K=14336 slots · tile-loop optimisation

✅ Done

Wins

Llama 3.1 8B fully covered +15–28% prefill vs CPU

Added K=4096 (attention + FFN gate/up) and K=14336 (FFN down) xclbin slots. Tile-loop optimised to reduce host overhead. All 4 dominant K dimensions in Llama 3.1 8B now offloaded to NPU.

Dual-Slot Dispatch

Second xclbin slot for K=5632 FFN down layers

✅ Done

Extended the backend to support a second simultaneously-loaded xclbin covering K=5632 (FFN down layers in TinyLlama). The slot system was generalised to support up to 8 slots with auto-selection by K dimension.

NPU Baseline

XDNA2 ggml backend · weight cache · K=2048 (TinyLlama 1.1B)

✅ Done

First working XDNA2 backend for llama.cpp. Implements GGML_OP_MUL_MAT dispatch via XRT to a compiled mlir-aie xclbin. Weights quantised to int8 with per-row scales and cached after first use. All non-MUL_MAT ops fall back to CPU.

Validated on TinyLlama 1.1B with K=2048 attention layers generating correct output.

Cumulative Progress

Metric	Phase 1 (baseline)	Best NPU-only	Phase 7 (current)
Prefill pp512	~4.6 t/s (CPU)	13.7 t/s (NPU Phase 6)	930 t/s (Vulkan)
Decode tg64	~4.4 t/s (CPU)	~4.1 t/s (CPU, NPU idle)	42 t/s (NPU)