AMD XDNA2 NPU Backend — Changelogs

Phase-by-phase progress on the custom llama.cpp XDNA2 backend. · GitHub

Ryzen AI MAX 385 · RyzenAI-npu5 (XDNA2)

Phase 8 — Performance Ceiling Investigation

Batch sweep · int4 · cascade · speculative decoding · 43.7 t/s ceiling confirmed · 2026-03-26

Decode +11.2× (42 t/s NPU) · Prefill +30× (930 t/s Vulkan) · 2026-03-26

pp2048: 19.5 t/s (+51% vs Phase 5) · all 4 K-slots on 4-col xclbins

8k context validated · NPU 2–3× over CPU · attention fallback characterised

NPU on dedicated XDNA2 silicon · bench-power.sh · J/tok baseline

K=4096 and K=14336 · Llama 3.1 8B fully covered · tile-loop optimisation

K=5632 FFN layers on NPU · multi-slot architecture (up to 8 slots)

First working XDNA2 backend · weight cache · K=2048 · TinyLlama 1.1B