8k context validated · attention fallback characterised · bench-context.sh · ← All changelogs
Ran a full sweep of context lengths (pp=512 through pp=8192) to characterise NPU throughput degradation at long context. Added tools/bench-context.sh for automated context-length sweeps.
Validated 8k context inference on 30 GiB RAM: KV cache ≈ 1 GiB + model ≈ 4.6 GiB leaves ample headroom. No OOM observed.
The NPU only offloads MUL_MAT ops where the K dimension matches a loaded xclbin. Attention score matmuls have K=seq_len (the current context length) — a variable that grows with every token. These always fall back to CPU.
At short context (pp=512), attention score matmuls are small and the NPU handles most of the compute. At pp=8192, attention score matmuls dominate and CPU becomes the bottleneck.
This is a fundamental architectural constraint, not a bug. The NPU excels at fixed-shape projection matmuls; a Vulkan or CPU attention implementation handles the dynamic-K attention path.
| Backend | pp=512 | pp=2048 | pp=4096 | pp=8192 |
|---|---|---|---|---|
| CPU only | 4.6 t/s | 4.3 t/s | 4.0 t/s | 3.6 t/s |
| NPU 1-col (Phase 5) | 10.2 t/s | 12.9 t/s | 11.7 t/s | 8.9 t/s |
NPU degrades 21% from pp=512→8192 (less than CPU's 22% — NPU is slightly more context-resilient). Peak at pp=2048 where the fixed-K projection tile is fully utilised.
Note: --ubatch-size 2048 was tested but did not improve NPU throughput — larger CPU-side attention batches (O(n²) memory) outweigh any tile utilisation gain.