Phase 3 — 8B Model Support

K=4096 and K=14336 slots · tile-loop optimisation · Llama 3.1 8B · ← All changelogs

Wins

Llama 3.1 8B fully covered (4 K-slots) +15–28% prefill vs CPU baseline Tile-loop optimised — reduced host overhead

What was built

Scaled from TinyLlama (K=2048, K=5632) to Llama 3.1 8B by adding two new xclbin slots covering the remaining dominant K dimensions:

K=4096 — attention QKV + output projections and FFN gate/up layers in Llama 3.1 8B (hidden dim = 4096).

K=14336 — FFN down projection in Llama 3.1 8B (intermediate dim = 14336). The largest and most compute-intensive layer; NPU offload gives the biggest absolute speedup here.

With all 4 slots loaded, every MUL_MAT with a fixed K in Llama 3.1 8B offloads to the NPU. Only attention score matmuls (K=variable, K=seq_len) remain on CPU.

Tile-loop optimisation

Reduced per-tile dispatch overhead by pre-allocating all XRT buffer objects at slot init time rather than per-call. Reuses context-resident tile buffers (tile_a, tile_b, tile_c) across calls to avoid heap churn.

Zero-fill of partial last tiles is now conditional: full tiles skip the memset, reducing host-side work for common even-multiple shapes.

Layer coverage — Llama 3.1 8B

Layer type	K dim	Coverage
Attention Q, K, V projections	K=4096	NPU ✓
Attention output projection	K=4096	NPU ✓
FFN gate projection	K=4096	NPU ✓
FFN up projection	K=4096	NPU ✓
FFN down projection	K=14336	NPU ✓
Attention score matmuls	K=seq_len (variable)	CPU (variable K)

Performance (Llama 3.1 8B Q4_K_M, 1-col NPU)

Backend	pp=512	pp=2048	pp=4096	pp=8192
CPU only	4.6 t/s	4.3 t/s	4.0 t/s	3.6 t/s
NPU Phase 3	~6 t/s	~8 t/s	~7 t/s	~5 t/s

+15–28% prefill vs CPU; performance scales with pp up to ~2048 where tile utilisation peaks.

← Phase 2 Phase 4 →