Custom llama.cpp backend offloading
GGML_OP_MUL_MAT to the AMD XDNA2 NPU via XRT kernel dispatch. ·
GitHub
Batch decode sweep · int4 kernel · cascade · speculative decoding · 2026-03-26
43.7 t/s is the hard performance ceiling for decode on this hardware+model combination. All four investigated approaches (batch decode, int4 quantisation, kernel cascade, speculative decoding) converge on the same bottleneck: LPDDR5 memory bandwidth serving the KV cache during single-token generation. No software technique can overcome a bandwidth-bound memory wall.
Swept batch decode from N=1 to N=64 to test whether increasing decode batch size could improve throughput. Result: completely flat at 43.7 t/s across all batch sizes.
| Batch N | Decode t/s | Change |
|---|---|---|
| 1 (baseline) | 43.7 t/s | — |
| 8 | 43.7 t/s | flat |
| 32 | 43.7 t/s | flat |
| 64 | 43.7 t/s | flat |
Diagnosis: Throughput is entirely dominated by reading the KV cache from LPDDR5 on every decode step. Increasing N multiplies KV reads proportionally; total bandwidth demand rises with N but peak bandwidth is fixed. The NPU's compute is saturated at N=1 — there is no headroom for batching to exploit.
Sanity-check tooling added: tools/int4_quant_sanity.c and tools/int4_quant_sanity.py — measure SNR of the quantisation pipeline at various bit widths without requiring a full model run.
Speculative decoding amortises the cost of the large target model by having a small, fast draft model propose multiple tokens at once. The target model then verifies the entire batch in a single forward pass (which is much cheaper than N separate forward passes).
Tokenizer compatibility issue: TinyLlama-1.1B uses SentencePiece (32k vocab) — incompatible with Llama-3.1-8B (tiktoken, 128k vocab). Llama-3.2-1B-Instruct was used instead (same llama-bpe 128k tokenizer).
| Metric | Value |
|---|---|
| n_drafted | 40 |
| n_accept | 1 (2.5%) |
| Effective t/s | 41.0 t/s |
2.5% acceptance rate — near-zero. n-gram lookup only works on repetitive/templated text (code generation, boilerplate). Generative explanatory prose has essentially no repeating n-grams. Ruled out.
First run without explicit context cap: target model consumed 8 GB of VRAM for a 64k-token KV cache, leaving no VRAM for the draft model. Draft ran on CPU at ~20 t/s — slower than the 43.7 t/s NPU decode. Effective throughput fell to 10.9 t/s.
Fix: set -c 8192 to cap context and free VRAM. Both models then offloaded fully (target 33/33 layers, draft 17/17 layers to Vulkan/iGPU).
| Config | Accept % | Draft speed | Effective t/s | vs baseline |
|---|---|---|---|---|
| Draft on CPU (no ctx cap) | 59.0% | 20 t/s | 10.9 t/s | −75% |
| Draft on GPU (-c 8192) | 44.4% | 212 t/s | 41.6 t/s | −5% |
| Baseline (no speculative) | — | — | 43.7 t/s | — |
Why it still doesn't help: The verification step in speculative decoding processes n_draft + 1 tokens as a batch through the target model. This is a prefill operation — it reads the KV cache once for all tokens. But the KV cache read is still bounded by LPDDR5 bandwidth. Verifying 9 tokens at once does not cost 9× less than verifying 1 — the KV cache I/O per step is similar, just spread across more tokens. The bandwidth wall applies to both single-token and batch-token verification.
A 44.4% acceptance rate with a 212 t/s draft model would normally yield ~1.8× speedup. The absence of any gain confirms the bottleneck is not in compute but purely in memory bandwidth.
The 43.7 t/s ceiling is hardware-fundamental. The LPDDR5 memory subsystem on the Ryzen AI MAX 385 delivers a fixed bandwidth budget. At decode time, every token requires reading the full KV cache (32 layers × 1024 bytes per token × context length). No matter how the compute is organised — single-token NPU dispatch, batched Vulkan verification, or speculative multi-token drafts — the bandwidth demand is structurally the same.
The only path to higher effective throughput at this level would be a hardware change: faster memory (LPDDR5X at higher bandwidth), a reduced-precision KV cache (fp8 or int8 KV — not yet supported in this build), or a fundamentally different model architecture with a smaller KV footprint (e.g. MLA as used in DeepSeek).