Phase 3 — 8B Model Support

K=4096 and K=14336 slots · tile-loop optimisation · Llama 3.1 8B  ·  ← All changelogs

Wins
Llama 3.1 8B fully covered (4 K-slots) +15–28% prefill vs CPU baseline Tile-loop optimised — reduced host overhead

What was built

Scaled from TinyLlama (K=2048, K=5632) to Llama 3.1 8B by adding two new xclbin slots covering the remaining dominant K dimensions:

K=4096 — attention QKV + output projections and FFN gate/up layers in Llama 3.1 8B (hidden dim = 4096).

K=14336 — FFN down projection in Llama 3.1 8B (intermediate dim = 14336). The largest and most compute-intensive layer; NPU offload gives the biggest absolute speedup here.

With all 4 slots loaded, every MUL_MAT with a fixed K in Llama 3.1 8B offloads to the NPU. Only attention score matmuls (K=variable, K=seq_len) remain on CPU.

Tile-loop optimisation

Reduced per-tile dispatch overhead by pre-allocating all XRT buffer objects at slot init time rather than per-call. Reuses context-resident tile buffers (tile_a, tile_b, tile_c) across calls to avoid heap churn.

Zero-fill of partial last tiles is now conditional: full tiles skip the memset, reducing host-side work for common even-multiple shapes.

Layer coverage — Llama 3.1 8B

Layer typeK dimCoverage
Attention Q, K, V projectionsK=4096NPU ✓
Attention output projectionK=4096NPU ✓
FFN gate projectionK=4096NPU ✓
FFN up projectionK=4096NPU ✓
FFN down projectionK=14336NPU ✓
Attention score matmulsK=seq_len (variable)CPU (variable K)

Performance (Llama 3.1 8B Q4_K_M, 1-col NPU)

Backendpp=512pp=2048pp=4096pp=8192
CPU only4.6 t/s4.3 t/s4.0 t/s3.6 t/s
NPU Phase 3~6 t/s~8 t/s~7 t/s~5 t/s

+15–28% prefill vs CPU; performance scales with pp up to ~2048 where tile utilisation peaks.