Custom llama.cpp backend offloading
GGML_OP_MUL_MAT to the AMD XDNA2 NPU via XRT kernel dispatch. ·
GitHub
TILE_N=64 decode xclbins · MIN_N=1 MAX_N=1 routing · 2026-03-26
| Test | Phase 6 | Phase 7 | Backend | Change |
|---|---|---|---|---|
| Prefill pp512 | 30.91 t/s | 930.92 t/s | Vulkan | +30× |
| Decode tg64 | 3.76 t/s | 43.84 t/s | NPU | +11.7× |
| Backend | Decode | Avg Power | J/tok |
|---|---|---|---|
| Phase 6 (CPU decode) | 3.76 t/s | 58.5 W | 16.2 J/tok |
| Vulkan | 41.6 t/s | 52.2 W | 1.3 J/tok |
| Phase 7 NPU decode | 43.84 t/s | 41.5 W | 0.947 J/tok |
Built 4 decode xclbins with TILE_N=64 (minimum n=16 per AIE hardware constraint — n=8 rejected by kernel). Swapped all 4 .zshrc slots from ~/xclbin-4col/ to ~/xclbin-decode/.
Added MIN_N=1, MAX_N=1 to restrict NPU to single-token decode. Vulkan picks up all prefill ops (N>1) automatically via llama.cpp backend scheduling.
Confirmed NPU handles 100% of decode: GGML_VK_DISABLE=1 leaves decode at 42.90 t/s (unchanged), proving Vulkan contributes nothing to decode throughput.
NPU decode speed (~42 t/s) matches Vulkan decode (~43 t/s) — advantage is dedicated XDNA2 silicon leaves iGPU free for other workloads.
XDNA2 hw_context limit clarified: driver source (npu5_regs.c) shows hwctx_limit=16, not 4. Previous 8-slot failure was column exhaustion (8×4-col = 32 cols > 16 available), not a firmware context cap.
4× AIE column parallelism · TILE_N=256 prefill xclbins · all 4 K-slots
| Test | Phase 5 | Phase 6 | Change |
|---|---|---|---|
| Prefill pp512 | 10.2 t/s | 13.7 t/s | +34% |
| Prefill pp2048 | 12.9 t/s | 19.5 t/s | +51% |
| Prefill pp4096 | 11.7 t/s | 16.2 t/s | +38% |
| Prefill pp8192 | 8.9 t/s | 10.9 t/s | +22% |
Rebuilt all 4 K-slot xclbins (K=2048, 4096, 5632, 14336) with n_aie_cols=4, upgrading from 1-column to 4-column dispatch. Peak throughput at pp=2048 where tile utilisation is highest.
Power draw increased: 58.5 W average (vs 45.8 W Phase 5) — all 4 AIE columns active during prefill. Decode still CPU-only, causing high idle NPU power during generation.
8k context validated · attention fallback characterised · bench-context.sh tooling
Validated 8k context on 30 GiB RAM (KV cache ≈ 1 GiB, model ≈ 4.6 GiB). NPU throughput peaks at pp=2048 (full tile utilisation) and degrades at longer context because attention score matmuls (K=seq_len, variable-K) fall back to CPU — only fixed-K projection matmuls offload.
Added bench-context.sh tooling for sweep benchmarks across context lengths.
NPU vs Vulkan power characterisation · bench-power.sh
| Backend | Prefill | Decode | Avg Power | J/tok (decode) |
|---|---|---|---|---|
| NPU 4-col (Phase 6) | 32.7 t/s | 3.6 t/s | 58.5 W | 16.2 J/tok |
| Vulkan iGPU | 632 t/s | 41.6 t/s | 52.2 W | 1.3 J/tok |
Confirmed NPU runs on dedicated XDNA2 silicon — no GPU contention. NPU is preferred when iGPU is busy with other workloads. Added bench-power.sh for SoC PPT measurement.
K=4096 and K=14336 slots · tile-loop optimisation
Added K=4096 (attention + FFN gate/up) and K=14336 (FFN down) xclbin slots. Tile-loop optimised to reduce host overhead. All 4 dominant K dimensions in Llama 3.1 8B now offloaded to NPU.
Second xclbin slot for K=5632 FFN down layers
Extended the backend to support a second simultaneously-loaded xclbin covering K=5632 (FFN down layers in TinyLlama). The slot system was generalised to support up to 8 slots with auto-selection by K dimension.
XDNA2 ggml backend · weight cache · K=2048 (TinyLlama 1.1B)
First working XDNA2 backend for llama.cpp. Implements GGML_OP_MUL_MAT dispatch via XRT to a compiled mlir-aie xclbin. Weights quantised to int8 with per-row scales and cached after first use. All non-MUL_MAT ops fall back to CPU.
Validated on TinyLlama 1.1B with K=2048 attention layers generating correct output.
| Metric | Phase 1 (baseline) | Best NPU-only | Phase 7 (current) |
|---|---|---|---|
| Prefill pp512 | ~4.6 t/s (CPU) | 13.7 t/s (NPU Phase 6) | 930 t/s (Vulkan) |
| Decode tg64 | ~4.4 t/s (CPU) | ~4.1 t/s (CPU, NPU idle) | 42 t/s (NPU) |