Phase 5 — Long Context & 1-Column NPU

8k context validated · attention fallback characterised · bench-context.sh  ·  ← All changelogs

Wins
8k context validated — no OOM NPU 2–3× over CPU at all context lengths Peak at pp=2048 (full tile utilisation) Attention fallback behaviour fully characterised

What was built

Ran a full sweep of context lengths (pp=512 through pp=8192) to characterise NPU throughput degradation at long context. Added tools/bench-context.sh for automated context-length sweeps.

Validated 8k context inference on 30 GiB RAM: KV cache ≈ 1 GiB + model ≈ 4.6 GiB leaves ample headroom. No OOM observed.

Why NPU degrades at long context

The NPU only offloads MUL_MAT ops where the K dimension matches a loaded xclbin. Attention score matmuls have K=seq_len (the current context length) — a variable that grows with every token. These always fall back to CPU.

At short context (pp=512), attention score matmuls are small and the NPU handles most of the compute. At pp=8192, attention score matmuls dominate and CPU becomes the bottleneck.

This is a fundamental architectural constraint, not a bug. The NPU excels at fixed-shape projection matmuls; a Vulkan or CPU attention implementation handles the dynamic-K attention path.

Performance vs context length (1-col NPU, Llama 3.1 8B Q4_K_M)

Backendpp=512pp=2048pp=4096pp=8192
CPU only4.6 t/s4.3 t/s4.0 t/s3.6 t/s
NPU 1-col (Phase 5)10.2 t/s12.9 t/s11.7 t/s8.9 t/s

NPU degrades 21% from pp=512→8192 (less than CPU's 22% — NPU is slightly more context-resilient). Peak at pp=2048 where the fixed-K projection tile is fully utilised.

Note: --ubatch-size 2048 was tested but did not improve NPU throughput — larger CPU-side attention batches (O(n²) memory) outweigh any tile utilisation gain.