Running LLMs on the AMD NPU with Lemonade Server

My Strix Halo machine has three compute accelerators: Zen 5 CPU cores, an integrated Radeon 8060S GPU, and an AMD XDNA 2 NPU with 8 columns. Until now, I’ve only used two of them. The GPU handles LLM inference through llama.cpp and Vulkan, and the CPU picks up everything else. The NPU has been sitting idle, a dedicated AI accelerator doing nothing.

AMD’s Lemonade Server is an OpenAI-compatible inference server that runs models on the XDNA 2 NPU using the FastFlowLM runtime. The pitch is compelling: offload smaller models to the NPU so the GPU stays free for larger workloads. Two accelerators running simultaneously instead of fighting over one.

Getting there on Linux required building most of the stack from source.

#The Driver Problem

The first obstacle was IOMMU. My GRUB config had amd_iommu=off, which completely prevents the kernel from seeing the NPU, /dev/accel/ never appears. Changing to amd_iommu=pt (passthrough mode) enables IOMMU for NPU detection while keeping DMA in passthrough so GPU performance isn’t impacted.

After reboot, the NPU PCI device (1022:17f0) was detected, but the in-tree amdxdna driver on kernel 6.19.x couldn’t bind. The firmware protocol version didn’t match, the NPU ships with firmware npu.sbin.1.1.2.65 (protocol 7.2), and the kernel module only supports up to 7.1.

The fix is building the XDNA driver stack from AMD’s source repo, which provides an updated DKMS module (v2.23.0) that speaks protocol 7.2. This means cloning amd/xdna-driver with recursive submodules, building XRT from source into four RPMs, and installing the DKMS kernel module:

sudo dnf install -y dkms
sudo /opt/xilinx/xrt/share/amdxdna/dkms_driver.sh --install

A reboot was necessary after install because the NPU was in a bad state from the old driver’s failed probe attempt. After that, verification was clean:

$ flm validate
[Linux]  Kernel: 6.19.8-200.fc43.x86_64
[Linux]  NPU: /dev/accel/accel0 with 8 columns
[Linux]  NPU FW Version: 255.0.11.71
[Linux]  amdxdna version: 1.0
[Linux]  Memlock Limit: infinity

#Building FastFlowLM

FastFlowLM is the NPU inference runtime that Lemonade uses under the hood. It also needs to be built from source:

cd ~/src/FastFlowLM/src
cmake --preset linux-default
cd build
cmake --build . -j$(nproc)
sudo cmake --install build

This installs to /opt/fastflowlm/bin/. One extra step: FLM depends on libxrt_coreutil.so.2, so the XRT library path needs to be registered:

echo '/opt/xilinx/xrt/lib64' | sudo tee /etc/ld.so.conf.d/xrt.conf
sudo ldconfig

#Installing Lemonade Server

With the driver and runtime in place, Lemonade itself is the easy part. The SDK needs Python 3.10–3.13 (system Python 3.14 is too new), and the server is a compiled binary from an RPM:

uv venv --python 3.13 ~/lemonade-venv
source ~/lemonade-venv/bin/activate && uv pip install lemonade-sdk
sudo dnf install -y https://github.com/lemonade-sdk/lemonade/releases/download/v10.0.0/lemonade-server-10.0.0.x86_64.rpm

The server runs on port 8002, bound to all interfaces so Docker containers can reach it. I originally used 8001, but that conflicted with llama-swap’s startPort for spawning child processes.

One implementation detail worth knowing: the lemonade-server CLI binary is actually a TrayApp wrapper that spawns lemonade-router as a child process and then exits without a terminal. For a headless systemd service, you want to run lemonade-router directly. My service includes an ExecStartPost that waits for the router to become healthy, then explicitly loads qwen3-4b-FLM via the /api/v1/load endpoint as a belt-and-suspenders guarantee on top of auto-discovery.

FLM models are managed with the flm CLI. I have three installed: Llama-3.2-1B, Qwen3-4B (auto-loaded on startup), and Qwen3-8B.

#Architecture: Two Accelerators, One API

The goal was to make both the NPU and GPU available through a single endpoint. Here’s how it fits together:

Applications / Agents
         │
    Port 4000
         │
    LiteLLM Proxy (Docker)
    ├── npu/*  → Lemonade Server (Port 8002)  → AMD NPU
    └── local/* → llama-swap (Port 8080)      → AMD iGPU

LiteLLM runs in Docker and proxies all model requests. Models prefixed npu/ route to Lemonade on port 8002; models prefixed local/ route to llama-swap on port 8080. Applications only need to know about LiteLLM on port 4000.

The LiteLLM config for an NPU model looks like:

- model_name: npu/Qwen3-4B
  litellm_params:
    model: openai/qwen3-4b-FLM
    api_base: http://host.docker.internal:8002/v1
    api_key: "dummy"

#LLM Benchmarks: NPU vs GPU

#Llama-3.2-1B on NPU

Starting with the smallest model to establish a baseline:

Metric	Value
Prefill (TTFT)	0.65s avg
Decode speed	39.5 t/s avg
Prefill throughput	83.6 t/s

39.5 tokens per second from a dedicated silicon accelerator running a 1B model. Not earth-shattering, but this is happening on hardware that draws a fraction of the GPU’s power.

#Head-to-Head: ~8B Models

This is the more interesting comparison. Qwen3-8B on the NPU via FLM versus Qwen3.5-9B on the GPU via llama.cpp Vulkan. Three prompts, three runs each.

Metric	GPU (Qwen3.5-9B)	NPU (Qwen3-8B)	Winner
TTFT	5.12s	2.22s	NPU 2.3x faster
Decode	19.7 t/s	8.2 t/s	GPU 2.4x faster
Completion tokens	236	215	,

The NPU delivers 2.3x faster time-to-first-token, lower dispatch latency means generation starts sooner. But the GPU has 2.4x higher sustained decode throughput once generation begins. The GPU’s TTFT is also inflated by Qwen3.5’s thinking-mode overhead, so the real prefill gap is probably smaller.

The tradeoff is clear: NPU starts faster, GPU generates faster. For short responses where TTFT dominates the user experience, the NPU wins. For long generations, the GPU pulls ahead.

#Concurrent NPU + GPU

The real question: can both accelerators run simultaneously without destroying each other’s performance? Same models, three prompts, three runs, 128 max tokens.

Metric	Sequential	Concurrent	Delta
NPU TTFT	2.12s	2.43s	+14.9%
NPU Decode	8.3 t/s	7.8 t/s	-5.8%
GPU TTFT	2.25s	2.67s	+18.7%
GPU Decode	18.2 t/s	15.6 t/s	-14.1%

Wall time: Sequential 238.9s vs Concurrent 168.8s = 1.42x speedup.

The NPU is remarkably resilient, only 5.8% decode loss under contention. Dedicated XDNA silicon has its own execution pipeline and doesn’t compete with the GPU for compute resources. The GPU takes a moderate 14.1% hit, which makes sense since Vulkan inference is memory-bandwidth-sensitive and both accelerators share the same LPDDR5X bus. TTFT increases 15–19% on both sides, pointing to memory bus saturation during prefill.

The bottom line: dual-accelerator operation works. For typical interleaved requests, which is how a real agent uses these models, the contention is acceptable.

#Whisper Benchmarks: NPU vs GPU vs CPU

Lemonade also serves Whisper-Large-v3-Turbo with NPU-accelerated encoder caching. I compared it against whisper-cpp-server running distil-large-v3.5 on Vulkan GPU and CPU.

#NPU vs GPU (Vulkan), 5 warm runs

Audio	Lemonade (NPU)	whisper.cpp (Vulkan GPU)	Winner
Short (7.6s)	0.78s (9.8x RT)	0.93s (8.1x RT)	NPU 20% faster
Long (40.9s)	2.33s (17.6x RT)	1.87s (21.9x RT)	GPU 20% faster

#NPU vs CPU baseline

Audio	Lemonade (NPU)	whisper.cpp (CPU)	Speedup
Short (7.6s)	0.73s	19.0s	25.9x
Long (40.9s)	2.48s	38.6s	15.6x

GPU and NPU are competitive on Whisper, trading leads within about 20% of each other. Short audio favors the NPU (lower dispatch overhead), long audio favors the GPU (higher raw compute throughput on larger encoder workloads). Both are dramatically faster than CPU, 15 to 26x faster.

The real advantage isn’t raw speed, it’s that Whisper can run on the NPU while the GPU handles LLM inference. No contention, no model swapping. The API is OpenAI-compatible:

curl http://127.0.0.1:8002/v1/audio/transcriptions \
  -F "file=@audio.wav" \
  -F "model=Whisper-Large-v3-Turbo" \
  -F "response_format=verbose_json"

#Putting It to Work: Agent Task Routing

With both accelerators available through LiteLLM, I reconfigured my Hermes agent to route auxiliary tasks based on what each accelerator is good at:

Task	Model	Accelerator	Rationale
Summarization	npu/Qwen3-4B	NPU	Text-only, zero GPU contention
Web extraction	npu/Qwen3-4B	NPU	Text-only, zero GPU contention
Session search	npu/Qwen3-4B	NPU	Text-only, zero GPU contention
Approval classification	npu/Qwen3-4B	NPU	Classification, zero GPU contention
Memory management	local/Qwen3.5-4B	GPU	Requires tool/function calling
MCP tools	local/Qwen3.5-4B	GPU	Requires tool calling
Vision	local/Gemma-4-E4B-IT	GPU	Requires multimodal input

NPU handles the text summarization and classification tasks that don’t need tool calling or multimodal capabilities. GPU handles everything that does. The main chat model stays on the GPU for the primary conversation, and the NPU chugs through background tasks without affecting its performance.

#Practical Takeaway

The AMD NPU on Strix Halo is a genuinely useful second accelerator for local AI workloads. Getting it running on Linux required building drivers, runtime, and firmware matching from source, it’s not a dnf install situation. But once the stack is up, Lemonade Server provides a clean OpenAI-compatible API that slots into existing infrastructure.

The performance profile is complementary rather than competitive. The NPU excels at fast startup (2.3x faster TTFT than GPU) and concurrent operation (only 5.8% decode penalty), while the GPU excels at sustained throughput (2.4x faster decode). Running both simultaneously gives a 1.42x wall-time speedup over sequential execution with acceptable contention.

For my setup, the NPU’s biggest value isn’t raw speed, it’s that background agent tasks no longer compete with the main conversation model for GPU time. Summarization, classification, and transcription happen on dedicated silicon while the GPU focuses on what it’s best at. That’s the kind of architectural win that makes the driver-building pain worthwhile.