First working XDNA2 backend for llama.cpp · ← All changelogs
A custom ggml backend that intercepts GGML_OP_MUL_MAT and dispatches it to the AMD XDNA2 NPU via XRT kernel dispatch. All other operations fall back to the CPU backend automatically via supports_op() returning false.
The backend implements the full ggml backend interface: buffer allocation, tensor quantisation, and asynchronous kernel dispatch with a fixed tile size (TILE_M × TILE_K × TILE_N). Larger matrices are tiled at the host level and dispatched as sequential tile calls.
Weight cache: model weights (static per layer) are quantised to int8 with per-row scales on first use and stored in a pinned XRT buffer. Subsequent tokens reuse the cached result — eliminating the dominant re-quantisation overhead.
Activation quantisation: activations change every token, so they are quantised per-call. Per-row int8 with a float32 scale per row.
Kernel: built with mlir-aie matrix_multiplication single-core example. Compiled xclbin encodes a fixed TILE_M × TILE_K × TILE_N shape.
XRT dispatch: one xrt::hw_context per xclbin slot. Activations and weights are DMA'd into XRT buffer objects (bo_a, bo_b) and the kernel writes to bo_c.
K=2048 coverage: attention projection layers in TinyLlama 1.1B. FFN down layers (K=5632) fall back to CPU at this phase.
| Backend | Prefill pp512 | Decode |
|---|---|---|
| CPU only | ~15 t/s | ~12 t/s |
| NPU 1-slot (Phase 1) | ~22 t/s | ~12 t/s |
Decode unchanged — NPU covers MUL_MAT prefill only; decode (M=1) uses the same xclbin path but the tile overhead dominates at batch size 1.
ggml/src/ggml-xdna/ggml-xdna.cpp — full backend implementation
ggml/src/ggml-xdna/ggml-xdna-quant.h — int8 quantisation helpers
ggml/src/ggml-xdna/CMakeLists.txt — build integration