Phase 4 — Power Measurement & Workload Isolation

NPU vs Vulkan characterisation · bench-power.sh · J/tok baseline  ·  ← All changelogs

Wins
NPU confirmed on dedicated XDNA2 silicon No GPU contention — NPU runs alongside iGPU Power baseline established for all phases bench-power.sh tooling
Finding

Vulkan is 12× more energy-efficient per token (1.3 J/tok vs 16.2 J/tok) due to much higher decode throughput. NPU's advantage is workload isolation — it runs on dedicated XDNA2 silicon with no iGPU contention.

What was built

Added tools/bench-power.sh — a script that captures SoC Package Power Tracking (PPT) via ryzenadj while running llama-bench. Samples wattage at 1 Hz and reports average power, peak power, and joules-per-token.

Ran a systematic comparison between NPU (Phase 3/4-col) and Vulkan iGPU across prefill and decode workloads to characterise the efficiency trade-off.

Power measurements (Ryzen AI MAX 385)

BackendPrefillDecodeAvg PowerJ/tok (decode)
NPU 4-col (Phase 6)32.7 t/s3.6 t/s58.5 W16.2 J/tok
Vulkan iGPU632 t/s41.6 t/s52.2 W1.3 J/tok

NPU draws more power than Vulkan despite lower throughput — all 4 AIE columns active during prefill; NPU idles at high power during CPU decode.

Workload isolation — the NPU advantage

The XDNA2 NPU runs on dedicated silicon with its own memory, power rails, and compute units. It does not share resources with the Radeon 8050S iGPU.

This means: NPU inference + iGPU gaming/rendering can run simultaneously with no contention. If the iGPU is busy, NPU is the correct backend.

Confirmed by running a GPU stress test alongside NPU inference: no throughput degradation observed on either workload.

When to use each backend

Use Vulkan when the iGPU is idle and you want maximum throughput or minimum J/tok.

Use NPU when the iGPU is occupied (gaming, rendering, GPU compute) — NPU inference runs fully independently with no GPU contention.