Phase 1 — NPU Baseline

First working XDNA2 backend for llama.cpp · ← All changelogs

Wins

XDNA2 backend operational NPU faster than CPU for prefill Weight cache eliminates re-quantisation TinyLlama 1.1B validated

What was built

A custom ggml backend that intercepts GGML_OP_MUL_MAT and dispatches it to the AMD XDNA2 NPU via XRT kernel dispatch. All other operations fall back to the CPU backend automatically via supports_op() returning false.

The backend implements the full ggml backend interface: buffer allocation, tensor quantisation, and asynchronous kernel dispatch with a fixed tile size (TILE_M × TILE_K × TILE_N). Larger matrices are tiled at the host level and dispatched as sequential tile calls.

Weight cache: model weights (static per layer) are quantised to int8 with per-row scales on first use and stored in a pinned XRT buffer. Subsequent tokens reuse the cached result — eliminating the dominant re-quantisation overhead.

Activation quantisation: activations change every token, so they are quantised per-call. Per-row int8 with a float32 scale per row.

Architecture

Kernel: built with mlir-aie matrix_multiplication single-core example. Compiled xclbin encodes a fixed TILE_M × TILE_K × TILE_N shape.

XRT dispatch: one xrt::hw_context per xclbin slot. Activations and weights are DMA'd into XRT buffer objects (bo_a, bo_b) and the kernel writes to bo_c.

K=2048 coverage: attention projection layers in TinyLlama 1.1B. FFN down layers (K=5632) fall back to CPU at this phase.

Performance (TinyLlama 1.1B)

Backend	Prefill pp512	Decode
CPU only	~15 t/s	~12 t/s
NPU 1-slot (Phase 1)	~22 t/s	~12 t/s

Decode unchanged — NPU covers MUL_MAT prefill only; decode (M=1) uses the same xclbin path but the tile overhead dominates at batch size 1.

Key files introduced

ggml/src/ggml-xdna/ggml-xdna.cpp — full backend implementation

ggml/src/ggml-xdna/ggml-xdna-quant.h — int8 quantisation helpers

ggml/src/ggml-xdna/CMakeLists.txt — build integration

Phase 2 →