Phase 2 — Dual-Slot Dispatch

Second xclbin slot for K=5632 FFN layers  ·  ← All changelogs

Wins
K=5632 FFN layers now on NPU Multi-slot architecture (up to 8) Auto-selection by K dimension

What was built

Extended the backend to load and manage multiple xclbin slots simultaneously. Each slot covers a different K dimension; find_slot(K) selects the best matching slot at dispatch time.

Added slot 2: K=5632 (FFN down projection layers in TinyLlama 1.1B). Previously these fell back to CPU — now offloaded to the NPU using a separate precompiled xclbin.

The slot system was designed to scale to 8 simultaneous xclbins, each covering a different matrix shape. Slots are loaded at startup from GGML_XDNA_XCLBIN_PATH_2 through _8 environment variables.

Slot selection logic

find_slot(K, N) scans loaded slots for an exact K match, then picks the slot with the largest tile_n ≤ N (best fit without overflow). Falls back to the smallest tile_n > N if no exact-fit slot exists — the tiling loop handles partial last tiles by zero-padding.

If no slot matches the K dimension, supports_op() returns false and the op falls back to the next registered backend (CPU or Vulkan).

Layer coverage after Phase 2 (TinyLlama 1.1B)

Layer typeK dimPhase 1Phase 2
Attention QKV projectionsK=2048NPU ✓NPU ✓
Attention output projectionK=2048NPU ✓NPU ✓
FFN gate / up projectionsK=2048NPU ✓NPU ✓
FFN down projectionK=5632CPU fallbackNPU ✓