Running LLMs on the AMD NPU with Lemonade Server
Setting up AMD's Lemonade Server on Strix Halo to run LLM and Whisper inference on the XDNA 2 NPU — driver builds, architecture decisions, and benchmarks against the integrated GPU.
My Strix Halo machine has three compute accelerators: Zen 5 CPU cores, an integrated Radeon 8060S GPU, and an AMD XDNA 2 NPU with 8 columns. Until now, I’ve only used two of them. The GPU handles LLM inference through llama.cpp and Vulkan, and the CPU picks up everything else. The NPU has been sitting idle — a dedicated AI accelerator doing nothing.
AMD’s Lemonade Server is an OpenAI-compatible inference server that runs models on the XDNA 2 NPU using the FastFlowLM runtime. The pitch is compelling: offload smaller models to the NPU so the GPU stays free for larger workloads. Two accelerators running simultaneously instead of fighting over one.
Getting there on Linux required building most of the stack from source.
The Driver Problem
The first obstacle was IOMMU. My GRUB config had amd_iommu=off, which completely prevents the kernel from seeing the NPU — /dev/accel/ never appears. Changing to amd_iommu=pt (passthrough mode) enables IOMMU for NPU detection while keeping DMA in passthrough so GPU performance isn’t impacted.
After reboot, the NPU PCI device (1022:17f0) was detected, but the in-tree amdxdna driver on kernel 6.19.x couldn’t bind. The firmware protocol version didn’t match — the NPU ships with firmware npu.sbin.1.1.2.65 (protocol 7.2), and the kernel module only supports up to 7.1.
The fix is building the XDNA driver stack from AMD’s source repo, which provides an updated DKMS module (v2.23.0) that speaks protocol 7.2. This means cloning amd/xdna-driver with recursive submodules, building XRT from source into four RPMs, and installing the DKMS kernel module:
sudo dnf install -y dkms
sudo /opt/xilinx/xrt/share/amdxdna/dkms_driver.sh --install
A reboot was necessary after install because the NPU was in a bad state from the old driver’s failed probe attempt. After that, verification was clean:
$ flm validate
[Linux] Kernel: 6.19.8-200.fc43.x86_64
[Linux] NPU: /dev/accel/accel0 with 8 columns
[Linux] NPU FW Version: 255.0.11.71
[Linux] amdxdna version: 1.0
[Linux] Memlock Limit: infinity
Building FastFlowLM
FastFlowLM is the NPU inference runtime that Lemonade uses under the hood. It also needs to be built from source:
cd ~/src/FastFlowLM/src
cmake --preset linux-default
cd build
cmake --build . -j$(nproc)
sudo cmake --install build
This installs to /opt/fastflowlm/bin/. One extra step: FLM depends on libxrt_coreutil.so.2, so the XRT library path needs to be registered:
echo '/opt/xilinx/xrt/lib64' | sudo tee /etc/ld.so.conf.d/xrt.conf
sudo ldconfig
Installing Lemonade Server
With the driver and runtime in place, Lemonade itself is the easy part. The SDK needs Python 3.10–3.13 (system Python 3.14 is too new), and the server is a compiled binary from an RPM:
uv venv --python 3.13 ~/lemonade-venv
source ~/lemonade-venv/bin/activate && uv pip install lemonade-sdk
sudo dnf install -y https://github.com/lemonade-sdk/lemonade/releases/download/v10.0.0/lemonade-server-10.0.0.x86_64.rpm
The server runs on port 8002, bound to all interfaces so Docker containers can reach it. I originally used 8001, but that conflicted with llama-swap’s startPort for spawning child processes.
One implementation detail worth knowing: the lemonade-server CLI binary is actually a TrayApp wrapper that spawns lemonade-router as a child process and then exits without a terminal. For a headless systemd service, you want to run lemonade-router directly. My service includes an ExecStartPost that waits for the router to become healthy, then explicitly loads qwen3-4b-FLM via the /api/v1/load endpoint as a belt-and-suspenders guarantee on top of auto-discovery.
FLM models are managed with the flm CLI. I have three installed: Llama-3.2-1B, Qwen3-4B (auto-loaded on startup), and Qwen3-8B.
Architecture: Two Accelerators, One API
The goal was to make both the NPU and GPU available through a single endpoint. Here’s how it fits together:
Applications / Agents
│
Port 4000
│
LiteLLM Proxy (Docker)
├── npu/* → Lemonade Server (Port 8002) → AMD NPU
└── local/* → llama-swap (Port 8080) → AMD iGPU
LiteLLM runs in Docker and proxies all model requests. Models prefixed npu/ route to Lemonade on port 8002; models prefixed local/ route to llama-swap on port 8080. Applications only need to know about LiteLLM on port 4000.
The LiteLLM config for an NPU model looks like:
- model_name: npu/Qwen3-4B
litellm_params:
model: openai/qwen3-4b-FLM
api_base: http://host.docker.internal:8002/v1
api_key: "dummy"
LLM Benchmarks: NPU vs GPU
Llama-3.2-1B on NPU
Starting with the smallest model to establish a baseline:
| Metric | Value |
|---|---|
| Prefill (TTFT) | 0.65s avg |
| Decode speed | 39.5 t/s avg |
| Prefill throughput | 83.6 t/s |
39.5 tokens per second from a dedicated silicon accelerator running a 1B model. Not earth-shattering, but this is happening on hardware that draws a fraction of the GPU’s power.
Head-to-Head: ~8B Models
This is the more interesting comparison. Qwen3-8B on the NPU via FLM versus Qwen3.5-9B on the GPU via llama.cpp Vulkan. Three prompts, three runs each.
| Metric | GPU (Qwen3.5-9B) | NPU (Qwen3-8B) | Winner |
|---|---|---|---|
| TTFT | 5.12s | 2.22s | NPU 2.3x faster |
| Decode | 19.7 t/s | 8.2 t/s | GPU 2.4x faster |
| Completion tokens | 236 | 215 | — |
The NPU delivers 2.3x faster time-to-first-token — lower dispatch latency means generation starts sooner. But the GPU has 2.4x higher sustained decode throughput once generation begins. The GPU’s TTFT is also inflated by Qwen3.5’s thinking-mode overhead, so the real prefill gap is probably smaller.
The tradeoff is clear: NPU starts faster, GPU generates faster. For short responses where TTFT dominates the user experience, the NPU wins. For long generations, the GPU pulls ahead.
Concurrent NPU + GPU
The real question: can both accelerators run simultaneously without destroying each other’s performance? Same models, three prompts, three runs, 128 max tokens.
| Metric | Sequential | Concurrent | Delta |
|---|---|---|---|
| NPU TTFT | 2.12s | 2.43s | +14.9% |
| NPU Decode | 8.3 t/s | 7.8 t/s | -5.8% |
| GPU TTFT | 2.25s | 2.67s | +18.7% |
| GPU Decode | 18.2 t/s | 15.6 t/s | -14.1% |
Wall time: Sequential 238.9s vs Concurrent 168.8s = 1.42x speedup.
The NPU is remarkably resilient — only 5.8% decode loss under contention. Dedicated XDNA silicon has its own execution pipeline and doesn’t compete with the GPU for compute resources. The GPU takes a moderate 14.1% hit, which makes sense since Vulkan inference is memory-bandwidth-sensitive and both accelerators share the same LPDDR5X bus. TTFT increases 15–19% on both sides, pointing to memory bus saturation during prefill.
The bottom line: dual-accelerator operation works. For typical interleaved requests — which is how a real agent uses these models — the contention is acceptable.
Whisper Benchmarks: NPU vs GPU vs CPU
Lemonade also serves Whisper-Large-v3-Turbo with NPU-accelerated encoder caching. I compared it against whisper-cpp-server running distil-large-v3.5 on Vulkan GPU and CPU.
NPU vs GPU (Vulkan) — 5 warm runs
| Audio | Lemonade (NPU) | whisper.cpp (Vulkan GPU) | Winner |
|---|---|---|---|
| Short (7.6s) | 0.78s (9.8x RT) | 0.93s (8.1x RT) | NPU 20% faster |
| Long (40.9s) | 2.33s (17.6x RT) | 1.87s (21.9x RT) | GPU 20% faster |
NPU vs CPU baseline
| Audio | Lemonade (NPU) | whisper.cpp (CPU) | Speedup |
|---|---|---|---|
| Short (7.6s) | 0.73s | 19.0s | 25.9x |
| Long (40.9s) | 2.48s | 38.6s | 15.6x |
GPU and NPU are competitive on Whisper, trading leads within about 20% of each other. Short audio favors the NPU (lower dispatch overhead), long audio favors the GPU (higher raw compute throughput on larger encoder workloads). Both are dramatically faster than CPU — 15 to 26x faster.
The real advantage isn’t raw speed — it’s that Whisper can run on the NPU while the GPU handles LLM inference. No contention, no model swapping. The API is OpenAI-compatible:
curl http://127.0.0.1:8002/v1/audio/transcriptions \
-F "file=@audio.wav" \
-F "model=Whisper-Large-v3-Turbo" \
-F "response_format=verbose_json"
Putting It to Work: Agent Task Routing
With both accelerators available through LiteLLM, I reconfigured my Hermes agent to route auxiliary tasks based on what each accelerator is good at:
| Task | Model | Accelerator | Rationale |
|---|---|---|---|
| Summarization | npu/Qwen3-4B | NPU | Text-only, zero GPU contention |
| Web extraction | npu/Qwen3-4B | NPU | Text-only, zero GPU contention |
| Session search | npu/Qwen3-4B | NPU | Text-only, zero GPU contention |
| Approval classification | npu/Qwen3-4B | NPU | Classification, zero GPU contention |
| Memory management | local/Qwen3.5-4B | GPU | Requires tool/function calling |
| MCP tools | local/Qwen3.5-4B | GPU | Requires tool calling |
| Vision | local/Gemma-4-E4B-IT | GPU | Requires multimodal input |
NPU handles the text summarization and classification tasks that don’t need tool calling or multimodal capabilities. GPU handles everything that does. The main chat model stays on the GPU for the primary conversation, and the NPU chugs through background tasks without affecting its performance.
Practical Takeaway
The AMD NPU on Strix Halo is a genuinely useful second accelerator for local AI workloads. Getting it running on Linux required building drivers, runtime, and firmware matching from source — it’s not a dnf install situation. But once the stack is up, Lemonade Server provides a clean OpenAI-compatible API that slots into existing infrastructure.
The performance profile is complementary rather than competitive. The NPU excels at fast startup (2.3x faster TTFT than GPU) and concurrent operation (only 5.8% decode penalty), while the GPU excels at sustained throughput (2.4x faster decode). Running both simultaneously gives a 1.42x wall-time speedup over sequential execution with acceptable contention.
For my setup, the NPU’s biggest value isn’t raw speed — it’s that background agent tasks no longer compete with the main conversation model for GPU time. Summarization, classification, and transcription happen on dedicated silicon while the GPU focuses on what it’s best at. That’s the kind of architectural win that makes the driver-building pain worthwhile.