Phase-by-phase progress on the custom llama.cpp XDNA2 backend. · GitHub
Batch sweep · int4 · cascade · speculative decoding · 43.7 t/s ceiling confirmed · 2026-03-26
Decode +11.2× (42 t/s NPU) · Prefill +30× (930 t/s Vulkan) · 2026-03-26
pp2048: 19.5 t/s (+51% vs Phase 5) · all 4 K-slots on 4-col xclbins
8k context validated · NPU 2–3× over CPU · attention fallback characterised
NPU on dedicated XDNA2 silicon · bench-power.sh · J/tok baseline
K=4096 and K=14336 · Llama 3.1 8B fully covered · tile-loop optimisation
K=5632 FFN layers on NPU · multi-slot architecture (up to 8 slots)
First working XDNA2 backend · weight cache · K=2048 · TinyLlama 1.1B