4× AIE column parallelism · TILE_N=256 · all 4 K-slots upgraded · ← All changelogs
4-col NPU draws 58.5 W avg (vs 45.8 W Phase 5) — all 4 AIE columns active during prefill. Decode remains CPU-only, so NPU idles at elevated power during generation.
Rebuilt all 4 xclbin slots (K=2048, 4096, 5632, 14336) with n_aie_cols=4, dispatching each matrix tile across 4 AIE columns simultaneously instead of 1. Each column handles a quarter of the output tile in parallel.
The mlir-aie build parameter n_aie_cols=4 changes the generated kernel to use a 4-column AIE array partition. The host-side tile loop is unchanged — the same TILE_M × TILE_K × TILE_N dispatch produces 4× more compute per call.
TILE_N was increased to 256 (from 64) to better utilise the wider 4-column output width. Each dispatch now produces a 2048×TILE_N output slice.
The NPU5 (Strix Halo) has 16 AIE columns and a driver-enforced hwctx_limit=16. Each 4-col xclbin occupies 4 columns, so a maximum of 4 simultaneous hw_contexts can be loaded (4 × 4 = 16 columns).
Attempting to load 8 slots (8 × 4-col = 32 columns) exceeded the physical column count — all NPU dispatches silently fell through to CPU. This set the 4-slot ceiling for all subsequent phases.
| Test | Phase 5 (1-col) | Phase 6 (4-col) | Gain |
|---|---|---|---|
| pp=512 | 10.2 t/s | 13.7 t/s | +34% |
| pp=2048 | 12.9 t/s | 19.5 t/s | +51% |
| pp=4096 | 11.7 t/s | 16.2 t/s | +38% |
| pp=8192 | 8.9 t/s | 10.9 t/s | +22% |
| Decode tg20 | ~4.1 t/s (CPU) | ~3.76 t/s (CPU) | — |
Peak gain at pp=2048 where the wider TILE_N=256 is fully saturated. Longer contexts see smaller gains as CPU attention dominates.
GGML_XDNA_MIN_N=2 — excludes single-token decode (M=1) from NPU to avoid overhead at batch size 1.
GGML_XDNA_TILE_N=256 (slots 1–4) — wider output tile for better 4-col utilisation.
All 4 xclbins point at ~/xclbin-4col/ (TILE_N=256, n_aie_cols=4 builds).