Phase 6 — Multi-Core NPU (4-Column)

4× AIE column parallelism · TILE_N=256 · all 4 K-slots upgraded  ·  ← All changelogs

Wins
pp2048: 19.5 t/s (+51% vs Phase 5) pp512: 13.7 t/s (+34% vs Phase 5) pp4096: 16.2 t/s (+38% vs Phase 5) All 4 K-slots on 4-col xclbins
Power trade-off

4-col NPU draws 58.5 W avg (vs 45.8 W Phase 5) — all 4 AIE columns active during prefill. Decode remains CPU-only, so NPU idles at elevated power during generation.

What was built

Rebuilt all 4 xclbin slots (K=2048, 4096, 5632, 14336) with n_aie_cols=4, dispatching each matrix tile across 4 AIE columns simultaneously instead of 1. Each column handles a quarter of the output tile in parallel.

The mlir-aie build parameter n_aie_cols=4 changes the generated kernel to use a 4-column AIE array partition. The host-side tile loop is unchanged — the same TILE_M × TILE_K × TILE_N dispatch produces 4× more compute per call.

TILE_N was increased to 256 (from 64) to better utilise the wider 4-column output width. Each dispatch now produces a 2048×TILE_N output slice.

Column resource constraint (discovered)

The NPU5 (Strix Halo) has 16 AIE columns and a driver-enforced hwctx_limit=16. Each 4-col xclbin occupies 4 columns, so a maximum of 4 simultaneous hw_contexts can be loaded (4 × 4 = 16 columns).

Attempting to load 8 slots (8 × 4-col = 32 columns) exceeded the physical column count — all NPU dispatches silently fell through to CPU. This set the 4-slot ceiling for all subsequent phases.

Performance vs Phase 5 (Llama 3.1 8B Q4_K_M)

TestPhase 5 (1-col)Phase 6 (4-col)Gain
pp=51210.2 t/s13.7 t/s+34%
pp=204812.9 t/s19.5 t/s+51%
pp=409611.7 t/s16.2 t/s+38%
pp=81928.9 t/s10.9 t/s+22%
Decode tg20~4.1 t/s (CPU)~3.76 t/s (CPU)

Peak gain at pp=2048 where the wider TILE_N=256 is fully saturated. Longer contexts see smaller gains as CPU attention dominates.

.zshrc configuration (Phase 6)

GGML_XDNA_MIN_N=2 — excludes single-token decode (M=1) from NPU to avoid overhead at batch size 1.

GGML_XDNA_TILE_N=256 (slots 1–4) — wider output tile for better 4-col utilisation.

All 4 xclbins point at ~/xclbin-4col/ (TILE_N=256, n_aie_cols=4 builds).