Local LLM Infrastructure on Strix Halo

My Strix Halo machine runs over 30 local models across two accelerators, an integrated Radeon 8060S GPU and an AMD XDNA 2 NPU. Each accelerator has its own inference server, its own API format, its own port, its own model naming scheme. Every application I build needs to know which server hosts which model and how to talk to it. That doesn’t scale.

The solution is three layers of software that turn this mess into a single OpenAI-compatible endpoint. LiteLLM proxies all requests through port 4000. llama-swap manages GPU models behind port 8080. Lemonade Server handles NPU models on port 8002. Applications only need to know about LiteLLM.

#Architecture

┌──────────────────────────────────────────────────────────────┐
│  Remote Device (laptop, phone, cloud VM on tailnet)          │
│  litellm.example.com → DNS → 100.x.x.x (Tailscale IP)        │
└──────────────────────┬───────────────────────────────────────┘
                       │
              Tailscale (WireGuard-encrypted)
                       │
┌──────────────────────▼───────────────────────────────────────┐
│  Home: sigma                                                 │
│                                                              │
│  ┌─────────────────────────────────────────────────────────┐ │
│  │ Caddy (:443, tls internal)                              │ │
│  │ litellm.* → :4000  swap.* → :8080  lemon.* → :8002      │ │
│  └──────────────────────┬──────────────────────────────────┘ │
│                         │                                    │
│  ┌──────────────────────▼──────────────────────────────────┐ │
│  │  Applications / Agents                                  │ │
│  │  (Hermes, Oneiros, Medium-Claw, etc.)                   │ │
│  │  All connect to localhost:4000                          │ │
│  └──────────────────────┬──────────────────────────────────┘ │
│                         │                                    │
│                    Port 4000                                 │
│                         │                                    │
│  ┌──────────────────────▼──────────────────────────────────┐ │
│  │  LiteLLM Proxy (Docker)                                 │ │
│  │  Unified model API                                      │ │
│  │  npu/* → :8002      local/* → :8080                     │ │
│  └───┬──────────────────────────────┬──────────────────────┘ │
│      │                              │                        │
│      │ host.docker.internal         │                        │
│   Port 8002                    Port 8080                     │
│      │                              │                        │
│  ┌───▼──────────────┐   ┌──────────▼─────────────────────┐   │
│  │ Lemonade Server  │   │ llama-swap                     │   │
│  │ v10.0.0          │   │                                │   │
│  │ ┌──────────────┐ │   │  ┌────────────────────────┐    │   │
│  │ │ FLM Backend  │ │   │  │ llama-server           │    │   │
│  │ │ (NPU)        │ │   │  │ (Vulkan GPU)           │    │   │
│  │ └──────┬───────┘ │   │  └──────┬─────────────────┘    │   │
│  │        │         │   │         │                      │   │
│  │        │         │   │  toolbox: llama-vulkan-radv    │   │
│  └────────┬─────────┘   └─────────┬──────────────────────┘   │
│           │                       │                          │
│    /dev/accel/accel0       /dev/dri/renderD128               │
│           │                       │                          │
│    ┌──────▼──────┐         ┌──────▼──────┐                   │
│    │  AMD NPU    │         │  AMD iGPU   │                   │
│    │  XDNA 2     │         │  RDNA 3.5   │                   │
│    │  8 columns  │         │  Vulkan     │                   │
│    └─────────────┘         └─────────────┘                   │
└──────────────────────────────────────────────────────────────┘

Remote devices reach Caddy over Tailscale; local applications connect directly to port 4000. Either way, the request hits LiteLLM, which routes by model prefix. Models prefixed npu/ go to Lemonade. Models prefixed local/ go to llama-swap. An application requesting local/Qwen3.5-35B-A3B has no idea it’s hitting a llama.cpp server running inside a Fedora toolbox container on the integrated GPU. It just sees an OpenAI-compatible API.

#llama-swap and Toolbox Containers

llama-swap is the GPU-side model orchestrator. It exposes port 8080 with an OpenAI-compatible API and manages the lifecycle of llama-server processes, loading, unloading, and hot-swapping models on demand. What’s unusual about this setup is that llama-server doesn’t run directly on the host.

#Why Toolbox Containers

The Strix Halo iGPU (GFX1151, RDNA 3.5) needs a carefully built llama.cpp binary, Vulkan RADV with cooperative matrix support, the right Mesa version, and Strix Halo-specific workarounds. Rather than maintaining a custom build on the host, llama-swap spawns its processes inside a Fedora toolbox container built specifically for this hardware.

The image is kyuz0/amd-strix-halo-toolboxes:vulkan-radv. It tracks llama.cpp master with each rebuild, includes the correct Mesa RADV stack for GFX1151, and carries Strix Halo-specific documentation about required flags and known issues. Toolbox containers share the host’s filesystem, network, and devices, so the llama-server process inside the container sees /dev/dri/renderD128 (the iGPU) and can listen on the host’s network as if it were running natively.

When llama.cpp merges an important fix (like the Gemma 4 tokenizer fix that silently dropped CJK characters), updating is a single command:

~/refresh-toolboxes.sh llama-vulkan-radv

This pulls the latest image, destroys the old container, and recreates it. Then restart llama-swap and the new build is live. No build-from-source cycle, no dependency management on the host.

#Strix Halo-Specific Tuning

The integrated GPU sharing 128 GB of unified LPDDR5X with the CPU creates a specific set of performance constraints. Every model definition in the llama-swap config includes flags tuned for this hardware:

Flag	Purpose
`--no-mmap`	Avoids memory-mapped I/O overhead on unified memory
`-fa on`	Enables flash attention
`-ngl 999`	Offloads all layers to GPU
`-b 512 -ub 256`	Mitigates the Vulkan ubatch cliff on Strix Halo
`--cache-type-k q8_0 --cache-type-v q8_0`	Halves KV cache memory with negligible quality impact

The GPU clock also needs to be forced to 2900 MHz, the default auto power management profile doesn’t reliably boost under Vulkan compute workloads, leaving the iGPU stuck at 600 MHz. A systemd service (amdgpu-perf-high.service) sets power_dpm_force_performance_level to high on boot. Without this fix, a 35B MoE model that should run at 33 t/s crawls at 27 t/s. With all optimizations applied, the same model went from 73.6 t/s prompt processing to 140.7 t/s, and available RAM went from 9.5 GB to 70 GB thanks to KV cache quantization and the --no-mmap switch.

#Config Structure

llama-swap uses YAML macros to keep the config DRY across 30+ model definitions:

macros:
  llama_server_base: "toolbox run -c llama-vulkan-radv llama-server"
  common_gpu_flags: "-fa on -ngl 999 -b 512 -ub 256"
  common_nommap: "--no-mmap"
  models_dir: "/home/zetaphor/Secondary/Models"

models:
  "qwen3.5-35b-a3b":
    cmd: >
      ${llama_server_base} ${common_gpu_flags} ${common_nommap}
      -c 262144
      --cache-type-k q8_0 --cache-type-v q8_0
      --jinja --temp 0.6 --min-p 0.0
      -m ${models_dir}/Qwen3.5-35B-A3B-UD-Q6_K_XL.gguf
    aliases:
      - "openai/qwen3.5-35b-a3b"
      - "local/Qwen3.5-35B-A3B"

The llama_server_base macro is the key detail, every model runs through toolbox run -c llama-vulkan-radv llama-server, which executes inside the container. The aliases list is how LiteLLM finds the model: its config references openai/qwen3.5-35b-a3b, which matches the alias, and llama-swap routes the request to the right process.

Models are grouped for management and can be preloaded on startup to avoid cold-start latency:

hooks:
  on_startup:
    preload:
      - "qwen3.5-35b-a3b"

groups:
  "all-models":
    members:
      - "qwen3.5-35b-a3b"
      - "qwen3.5-9b"
      - "qwen3.5-27b"
      # ... 25+ more models

llama-swap runs as a systemd user service. Child llama-server processes use startPort: 8001 internally, which is why Lemonade Server moved from port 8001 to 8002, avoiding that conflict.

#LiteLLM: The Unified Proxy

LiteLLM runs in Docker and serves port 4000. It’s the single entry point for every application. The config maps user-facing model names to backend endpoints:

model_list:
  # GPU models via llama-swap
  - model_name: local/Qwen3.5-35B-A3B
    litellm_params:
      model: openai/qwen3.5-35b-a3b
      api_base: http://host.docker.internal:8080/v1
      api_key: "dummy"
    model_info:
      input_cost_per_token: 0.00000200
      output_cost_per_token: 0.00000800

  # NPU models via Lemonade
  - model_name: npu/Qwen3-4B
    litellm_params:
      model: openai/qwen3-4b-FLM
      api_base: http://host.docker.internal:8002/v1
      api_key: "dummy"
    model_info:
      input_cost_per_token: 0.00000005
      output_cost_per_token: 0.00000020

Since LiteLLM runs in Docker, it reaches the host-side backends via host.docker.internal. The api_key: "dummy" satisfies LiteLLM’s requirement for a key even though neither backend needs authentication.

One operational quirk worth knowing: LiteLLM’s config has store_model_in_db: true, which means a plain docker compose restart does not pick up new model entries from the YAML file. Adding a model requires a full teardown:

cd ~/docker/litellm && docker compose down litellm && docker compose up -d litellm

The litellm_params.model value must exactly match an alias in the llama-swap config. If llama-swap defines the alias openai/qwen3.5-35b-a3b, then LiteLLM must use model: openai/qwen3.5-35b-a3b. A mismatch means “no healthy deployments” errors that can be confusing to debug.

#Lemonade Server: The NPU Backend

Lemonade Server is the third piece, AMD’s OpenAI-compatible inference server that runs models on the XDNA 2 NPU using the FastFlowLM runtime. I covered the full setup, driver builds, and benchmarks in Running LLMs on the AMD NPU with Lemonade Server. Here’s how it fits into the infrastructure.

Lemonade runs as a systemd user service on port 8002, bound to all interfaces so Docker containers can reach it. It serves FLM-compiled models, currently Llama-3.2-1B, Qwen3-4B (auto-loaded on startup), and Qwen3-8B. The API is OpenAI-compatible, so from LiteLLM’s perspective it looks identical to llama-swap, just a different api_base.

The NPU’s value in this architecture isn’t raw speed. As documented in the Lemonade post, the GPU has 2.4x higher sustained decode throughput on comparable models. The value is concurrent operation without contention. The NPU runs on dedicated XDNA silicon with its own execution pipeline. Running Qwen3-4B on the NPU while Qwen3.5-35B-A3B runs on the GPU costs only a 5.8% decode penalty on the NPU side, essentially free parallelism.

This is why my Hermes agent routes background tasks to npu/Qwen3-4B:

Task	Model	Accelerator
Summarization	npu/Qwen3-4B	NPU
Web extraction	npu/Qwen3-4B	NPU
Session search	npu/Qwen3-4B	NPU
Approval classification	npu/Qwen3-4B	NPU
Memory management	local/Qwen3.5-4B	GPU
MCP tools	local/Qwen3.5-4B	GPU
Vision	local/Gemma-4-E4B-IT	GPU

Text-only tasks that don’t need tool calling go to the NPU. Tasks requiring function calling or multimodal input stay on the GPU. The main chat model runs on the GPU for the primary conversation, and background NPU tasks don’t affect its performance.

#Remote Access: Tailscale-Only Subdomains

Everything described so far runs on a single machine at home. That’s fine when I’m sitting at the desk, but I also want to hit this infrastructure from my laptop, my phone, or any other device on my network, using clean URLs instead of remembering port numbers.

The key constraint is that these services should not be on the public internet. I don’t want anyone outside my network reaching LiteLLM, llama-swap, or any of the other homelab services. The solution is Tailscale combined with a local reverse proxy.

#How Domain Resolution Works

A wildcard DNS record (*.example.com) points to the machine’s Tailscale IP, a 100.x.x.x address in the CGNAT range. Only devices enrolled in my Tailscale network can route to that IP. Anyone else who resolves the domain gets an address they simply can’t reach, and the connection silently times out.

Tailnet device
      │
 DNS: litellm.example.com
 resolves to 100.x.x.x
      │
      ▼
 Tailscale routes to sigma
      │
 Caddy (:443, tls internal) → localhost:4000
      │
 LiteLLM responds

No VPN client configuration, no authentication prompts, the network layer itself is the access control. The wildcard means every new subdomain works automatically without touching DNS.

#The Local Reverse Proxy

DNS resolves the subdomain to an IP, but it can’t encode port numbers. litellm.example.com reaches port 443 by default, not port 4000. A reverse proxy on the machine handles the subdomain-to-port mapping.

Caddy runs in Docker with network_mode: host so it can reach all local services directly. The Tailscale IP is in the CGNAT range, so it can’t get a Let’s Encrypt certificate, ACME challenge servers can’t reach it. Instead, Caddy uses tls internal to generate certificates from its own local CA. A shared Caddyfile snippet keeps this DRY across all subdomains:

{
    auto_https disable_redirects
}

(tailscale_tls) {
    tls internal
}

litellm.example.com {
    import tailscale_tls
    reverse_proxy localhost:4000
}

swap.example.com {
    import tailscale_tls
    reverse_proxy localhost:8080
}

The auto_https disable_redirects global option prevents Caddy from adding automatic HTTP-to-HTTPS redirects, tailnet clients connect directly over HTTPS. Each subdomain block imports the tailscale_tls snippet, which tells Caddy to use its internal CA for certificate generation. Caddy generates and renews these certificates automatically with no external dependencies.

#Trusting the Internal CA

Since Caddy’s internal CA isn’t publicly trusted, each tailnet device needs the root CA certificate installed once. The cert lives in the Caddy data volume and can be extracted with:

docker exec caddy-tailscale cat /data/caddy/pki/authorities/local/root.crt > caddy-root-ca.crt

Install it into the system trust store (e.g. /etc/pki/ca-trust/source/anchors/ on Fedora, /usr/local/share/ca-certificates/ on Debian) and update the trust database. After that, browsers and CLI tools trust *.example.com over HTTPS with no warnings. If the Caddy data volume is ever recreated, a new CA is generated and the cert must be redistributed.

#Firewall Isolation

On Fedora, firewalld controls which ports are reachable on which interfaces. A dedicated tailscale zone is bound to the tailscale0 interface with only http and https services allowed. The default FedoraWorkstation zone on the LAN interface does not include http or https, so ports 80 and 443 are only reachable over the tailnet, not from the local network or the internet.

#What Gets Exposed

The LLM infrastructure services each have their own subdomain, but the same pattern extends to every other homelab service, search engines, dashboards, databases, remote desktop. Adding a new one is a Caddyfile block with the tailscale_tls import and a reload.

The LLM-relevant subdomains:

Subdomain	Home Service	Port
`litellm.*`	LiteLLM Proxy	4000
`swap.*`	llama-swap	8080
`lemon.*`	Lemonade Server	8002

From any device on my tailnet, I can point an OpenAI-compatible client at https://litellm.example.com/v1 and get the full model catalog, GPU models, NPU models, embeddings, all of it. From my phone, my laptop at a coffee shop, a cloud VM I’ve enrolled in the tailnet, anywhere Tailscale is connected.

#Pricing for Observability

With the routing and access layers in place, the next problem is visibility into what’s actually happening. Local models have zero direct API cost, but that makes LiteLLM’s dashboards and spend logs useless, every request shows $0.00, every usage report is flat, and you lose the ability to compare model costs or track which models get the most use relative to their “expense.”

The solution is estimated pricing. Each model gets model_info entries for input_cost_per_token and output_cost_per_token, calibrated against real API provider rates and scaled by model size. The numbers let me see at a glance that my agent burned through the equivalent of $2.40 on reasoning tasks yesterday, mostly on the 35B model, and whether that usage pattern makes sense.

#Pricing Tiers

Reference rates are derived from OpenAI and Anthropic pricing, then assigned to local models based on active parameter count and capability:

Tier	Input (per 1M)	Output (per 1M)	Models
Budget	$0.10–0.30	$0.40–1.20	Qwen3.5-0.8B, Qwen3.5-2B, Qwen3.5-4B, TinyAgent-1.1B, GLM-4.7-Flash
Mid-range	$0.25–4.00	$1.25–20.00	Qwen3.5-9B, Qwen3.5-27B, Qwen3.5-35B-A3B, Qwen3-Coder, Cydonia-24B
Premium	$5.00	$25.00	Qwen3-235B-A22B, MiniMax-M2.5-REAP-172B
Embeddings	$0.08–0.10	$0.00	snowflake-arctic-v2, nomic-embed-text, qwen3-embedding-8b
NPU	$0.05	$0.20	npu/Qwen3-4B, npu/Qwen3-8B

The output:input ratio stays around 4–5x across the board, matching the pattern used by commercial API providers. NPU models are priced deliberately lower than their GPU equivalents, the NPU draws a fraction of the power and runs on dedicated silicon, so the “cost” of running a 4B model on the NPU is genuinely less than running it on the GPU even though both are local.

The per-token format LiteLLM expects is the per-1M-token price divided by 1,000,000. So $2.00/1M input becomes 0.00000200 in the config.

#Adding a New Model

The end-to-end workflow for adding a model demonstrates how the pieces connect:

1. Download the GGUF:

hf download unsloth/Qwen3.5-9B-GGUF Qwen3.5-9B-UD-Q8_K_XL.gguf \
  --local-dir ~/Secondary/Models

2. Add to llama-swap (~/llama-swap/llama-swap.config.yaml):

  "qwen3.5-9b":
    cmd: >
      ${llama_server_base} ${common_gpu_flags} ${common_nommap}
      -c 131072
      --cache-type-k q8_0 --cache-type-v q8_0
      --jinja --temp 0.6 --min-p 0.0
      -m ${models_dir}/Qwen3.5-9B-UD-Q8_K_XL.gguf
    aliases:
      - "openai/qwen3.5-9b"
      - "local/Qwen3.5-9B"

3. Add to LiteLLM (~/docker/litellm/litellm_config.yaml):

  - model_name: local/Qwen3.5-9B
    litellm_params:
      model: openai/qwen3.5-9b
      api_base: http://host.docker.internal:8080/v1
      api_key: "dummy"
    model_info:
      input_cost_per_token: 0.00000030
      output_cost_per_token: 0.00000120

4. Restart both:

systemctl --user restart llama-swap
cd ~/docker/litellm && docker compose down litellm && docker compose up -d litellm

5. Validate:

curl -s http://localhost:4000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "local/Qwen3.5-9B", "messages": [{"role": "user", "content": "Say hello"}], "max_tokens": 32}'

The request hits LiteLLM on port 4000, which routes to llama-swap on port 8080, which spawns (or reuses) a llama-server process inside the toolbox container. The response flows back through the same chain. From the caller’s perspective, it’s just an OpenAI API call.

#Practical Takeaway

This stack evolved from much simpler beginnings. Lavabo was the first attempt, a monolithic Docker container bundling LLMs, embeddings, vision, and TTS into one server. That worked when I had a handful of models. It stopped working when I needed 30+ models across multiple accelerators, hot-swapping on demand, with observability into usage patterns.

The current setup, LiteLLM as proxy, llama-swap as GPU orchestrator, Lemonade as NPU backend, gives me a single OpenAI-compatible endpoint that transparently routes to the right accelerator. Adding a model takes five minutes. Swapping the underlying inference engine is invisible to applications. Usage tracking works because estimated pricing makes the dashboards useful even though the actual cost is just electricity.

Every project I’ve built since, Oneiros, Medium-Claw, and the various voice and TTS experiments, connects to this same infrastructure. They all see http://localhost:4000/v1 and a list of model names. The complexity of toolbox containers, GPU clock management, NPU driver stacks, and model lifecycle orchestration is hidden behind that single endpoint. That’s the whole point.