K=4096 and K=14336 slots · tile-loop optimisation · Llama 3.1 8B · ← All changelogs
Scaled from TinyLlama (K=2048, K=5632) to Llama 3.1 8B by adding two new xclbin slots covering the remaining dominant K dimensions:
K=4096 — attention QKV + output projections and FFN gate/up layers in Llama 3.1 8B (hidden dim = 4096).
K=14336 — FFN down projection in Llama 3.1 8B (intermediate dim = 14336). The largest and most compute-intensive layer; NPU offload gives the biggest absolute speedup here.
With all 4 slots loaded, every MUL_MAT with a fixed K in Llama 3.1 8B offloads to the NPU. Only attention score matmuls (K=variable, K=seq_len) remain on CPU.
Reduced per-tile dispatch overhead by pre-allocating all XRT buffer objects at slot init time rather than per-call. Reuses context-resident tile buffers (tile_a, tile_b, tile_c) across calls to avoid heap churn.
Zero-fill of partial last tiles is now conditional: full tiles skip the memset, reducing host-side work for common even-multiple shapes.
| Layer type | K dim | Coverage |
|---|---|---|
| Attention Q, K, V projections | K=4096 | NPU ✓ |
| Attention output projection | K=4096 | NPU ✓ |
| FFN gate projection | K=4096 | NPU ✓ |
| FFN up projection | K=4096 | NPU ✓ |
| FFN down projection | K=14336 | NPU ✓ |
| Attention score matmuls | K=seq_len (variable) | CPU (variable K) |
| Backend | pp=512 | pp=2048 | pp=4096 | pp=8192 |
|---|---|---|---|---|
| CPU only | 4.6 t/s | 4.3 t/s | 4.0 t/s | 3.6 t/s |
| NPU Phase 3 | ~6 t/s | ~8 t/s | ~7 t/s | ~5 t/s |
+15–28% prefill vs CPU; performance scales with pp up to ~2048 where tile utilisation peaks.