Second xclbin slot for K=5632 FFN layers · ← All changelogs
Extended the backend to load and manage multiple xclbin slots simultaneously. Each slot covers a different K dimension; find_slot(K) selects the best matching slot at dispatch time.
Added slot 2: K=5632 (FFN down projection layers in TinyLlama 1.1B). Previously these fell back to CPU — now offloaded to the NPU using a separate precompiled xclbin.
The slot system was designed to scale to 8 simultaneous xclbins, each covering a different matrix shape. Slots are loaded at startup from GGML_XDNA_XCLBIN_PATH_2 through _8 environment variables.
find_slot(K, N) scans loaded slots for an exact K match, then picks the slot with the largest tile_n ≤ N (best fit without overflow). Falls back to the smallest tile_n > N if no exact-fit slot exists — the tiling loop handles partial last tiles by zero-padding.
If no slot matches the K dimension, supports_op() returns false and the op falls back to the next registered backend (CPU or Vulkan).
| Layer type | K dim | Phase 1 | Phase 2 |
|---|---|---|---|
| Attention QKV projections | K=2048 | NPU ✓ | NPU ✓ |
| Attention output projection | K=2048 | NPU ✓ | NPU ✓ |
| FFN gate / up projections | K=2048 | NPU ✓ | NPU ✓ |
| FFN down projection | K=5632 | CPU fallback | NPU ✓ |