AMD XDNA2 NPU Backend — Changelog

Custom llama.cpp backend offloading GGML_OP_MUL_MAT to the AMD XDNA2 NPU via XRT kernel dispatch.  ·  GitHub

Ryzen AI MAX 385  ·  RyzenAI-npu5 (XDNA2)  ·  Meta-Llama-3.1-8B-Instruct Q4_K_M
7

NPU Decode Acceleration + Vulkan Prefill

TILE_N=64 decode xclbins · MIN_N=1 MAX_N=1 routing · 2026-03-26

✅ Done
Wins
Decode +11.7× (3.76 → 43.84 t/s) Prefill +30× (31 → 931 t/s) NPU decode confirmed hardware Beats Vulkan: 0.947 vs 1.3 J/tok Lower power than Vulkan: 41.5 vs 52.2 W
TestPhase 6Phase 7BackendChange
Prefill pp512 30.91 t/s 930.92 t/s Vulkan +30×
Decode tg64 3.76 t/s 43.84 t/s NPU +11.7×
BackendDecodeAvg PowerJ/tok
Phase 6 (CPU decode)3.76 t/s58.5 W16.2 J/tok
Vulkan41.6 t/s52.2 W1.3 J/tok
Phase 7 NPU decode43.84 t/s41.5 W0.947 J/tok

Built 4 decode xclbins with TILE_N=64 (minimum n=16 per AIE hardware constraint — n=8 rejected by kernel). Swapped all 4 .zshrc slots from ~/xclbin-4col/ to ~/xclbin-decode/.

Added MIN_N=1, MAX_N=1 to restrict NPU to single-token decode. Vulkan picks up all prefill ops (N>1) automatically via llama.cpp backend scheduling.

Confirmed NPU handles 100% of decode: GGML_VK_DISABLE=1 leaves decode at 42.90 t/s (unchanged), proving Vulkan contributes nothing to decode throughput.

NPU decode speed (~42 t/s) matches Vulkan decode (~43 t/s) — advantage is dedicated XDNA2 silicon leaves iGPU free for other workloads.

XDNA2 hw_context limit clarified: driver source (npu5_regs.c) shows hwctx_limit=16, not 4. Previous 8-slot failure was column exhaustion (8×4-col = 32 cols > 16 available), not a firmware context cap.

6

Multi-Core NPU (4-Column, TILE_N=256)

4× AIE column parallelism · TILE_N=256 prefill xclbins · all 4 K-slots

✅ Done
Wins
pp2048: 19.5 t/s (+51% vs Phase 5) pp512: 13.7 t/s (+34% vs Phase 5) All 4 K-slots upgraded to 4-col
TestPhase 5Phase 6Change
Prefill pp51210.2 t/s13.7 t/s+34%
Prefill pp204812.9 t/s19.5 t/s+51%
Prefill pp409611.7 t/s16.2 t/s+38%
Prefill pp81928.9 t/s10.9 t/s+22%

Rebuilt all 4 K-slot xclbins (K=2048, 4096, 5632, 14336) with n_aie_cols=4, upgrading from 1-column to 4-column dispatch. Peak throughput at pp=2048 where tile utilisation is highest.

Power draw increased: 58.5 W average (vs 45.8 W Phase 5) — all 4 AIE columns active during prefill. Decode still CPU-only, causing high idle NPU power during generation.

5

Long Context & Context-Length Characterisation

8k context validated · attention fallback characterised · bench-context.sh tooling

✅ Done
Wins
8k context validated (no OOM) NPU 2–3× over CPU at all lengths Peak throughput at pp=2048

Validated 8k context on 30 GiB RAM (KV cache ≈ 1 GiB, model ≈ 4.6 GiB). NPU throughput peaks at pp=2048 (full tile utilisation) and degrades at longer context because attention score matmuls (K=seq_len, variable-K) fall back to CPU — only fixed-K projection matmuls offload.

Added bench-context.sh tooling for sweep benchmarks across context lengths.

4

Power Measurement & Workload Isolation

NPU vs Vulkan power characterisation · bench-power.sh

✅ Done
Wins
NPU confirmed non-competing (XDNA2 silicon) Power baseline established
BackendPrefillDecodeAvg PowerJ/tok (decode)
NPU 4-col (Phase 6)32.7 t/s3.6 t/s58.5 W16.2 J/tok
Vulkan iGPU632 t/s41.6 t/s52.2 W1.3 J/tok

Confirmed NPU runs on dedicated XDNA2 silicon — no GPU contention. NPU is preferred when iGPU is busy with other workloads. Added bench-power.sh for SoC PPT measurement.

3

8B Model Support (4 K-Slots)

K=4096 and K=14336 slots · tile-loop optimisation

✅ Done
Wins
Llama 3.1 8B fully covered +15–28% prefill vs CPU

Added K=4096 (attention + FFN gate/up) and K=14336 (FFN down) xclbin slots. Tile-loop optimised to reduce host overhead. All 4 dominant K dimensions in Llama 3.1 8B now offloaded to NPU.

2

Dual-Slot Dispatch

Second xclbin slot for K=5632 FFN down layers

✅ Done

Extended the backend to support a second simultaneously-loaded xclbin covering K=5632 (FFN down layers in TinyLlama). The slot system was generalised to support up to 8 slots with auto-selection by K dimension.

1

NPU Baseline

XDNA2 ggml backend · weight cache · K=2048 (TinyLlama 1.1B)

✅ Done

First working XDNA2 backend for llama.cpp. Implements GGML_OP_MUL_MAT dispatch via XRT to a compiled mlir-aie xclbin. Weights quantised to int8 with per-row scales and cached after first use. All non-MUL_MAT ops fall back to CPU.

Validated on TinyLlama 1.1B with K=2048 attention layers generating correct output.

MetricPhase 1 (baseline)Best NPU-onlyPhase 7 (current)
Prefill pp512 ~4.6 t/s (CPU) 13.7 t/s (NPU Phase 6) 930 t/s (Vulkan)
Decode tg64 ~4.4 t/s (CPU) ~4.1 t/s (CPU, NPU idle) 42 t/s (NPU)