AMD XDNA2 NPU Backend — Changelogs

Phase-by-phase progress on the custom llama.cpp XDNA2 backend.  ·  GitHub

Ryzen AI MAX 385  ·  RyzenAI-npu5 (XDNA2)
8

Phase 8 — Performance Ceiling Investigation

Batch sweep · int4 · cascade · speculative decoding · 43.7 t/s ceiling confirmed · 2026-03-26

7

Phase 7 — NPU Decode Acceleration + Vulkan Prefill

Decode +11.2× (42 t/s NPU) · Prefill +30× (930 t/s Vulkan) · 2026-03-26

6

Phase 6 — Multi-Core NPU (4-Column, TILE_N=256)

pp2048: 19.5 t/s (+51% vs Phase 5) · all 4 K-slots on 4-col xclbins

5

Phase 5 — Long Context & 1-Column NPU

8k context validated · NPU 2–3× over CPU · attention fallback characterised

4

Phase 4 — Power Measurement & Workload Isolation

NPU on dedicated XDNA2 silicon · bench-power.sh · J/tok baseline

3

Phase 3 — 8B Model Support (4 K-Slots)

K=4096 and K=14336 · Llama 3.1 8B fully covered · tile-loop optimisation

2

Phase 2 — Dual-Slot Dispatch

K=5632 FFN layers on NPU · multi-slot architecture (up to 8 slots)

1

Phase 1 — NPU Baseline

First working XDNA2 backend · weight cache · K=2048 · TinyLlama 1.1B