Speculative decoding has been a known technique for a while, but it recently got attention with a post on r/LocalLLaMA showing +29% average speedup (and +50% on code) for Gemma 4 31B using the tiny Gemma 4 E2B as a draft model — on an RTX 5090. The results were compelling enough that I wanted to see what happens on a fundamentally different architecture: my Strix Halo system, where inference is memory-bandwidth bound rather than compute bound.

The short version: speculative decoding works, and it works better than I expected. But the optimal configuration is completely different from what works on discrete GPUs.

What Is Speculative Decoding?

During normal autoregressive generation, the model produces one token at a time. Each token requires reading the full model weights through memory — for a 31B Q8 model, that’s ~34 GB per token. On Strix Halo with ~218 GB/s memory bandwidth, that math alone caps you around 6 t/s.

Speculative decoding changes the loop. A small, fast “draft” model generates several candidate tokens ahead. The main model then verifies all of them in a single batch (which is much cheaper than generating them one by one, since batch processing reuses the same weight reads). Any tokens the main model agrees with are accepted for free. Rejected tokens are discarded and the main model’s output is used instead.

The key insight: this is lossless. The target model always has final say, so output quality is identical to running without speculative decoding. You’re trading extra compute for fewer sequential memory reads.

The Setup

ComponentSpecification
CPU/GPUAMD Ryzen AI MAX+ 395 / Radeon 8060S (Strix Halo)
RAM128 GB LPDDR5X unified memory (~218 GB/s)
BackendVulkan (Mesa RADV), llama-swap → llama-server via toolbox
Main modelgemma-4-31B-it-UD-Q8_K_XL.gguf (~34 GB)
Draft modelgemma-4-E2B-it-UD-Q8_K_XL.gguf (~5 GB)
Flags-fa on -ngl 999 -b 512 -ub 256 --no-mmap --cache-type-k q8_0 --cache-type-v q8_0 --parallel 1

The E2B model is Gemma 4’s smallest variant at ~2B effective parameters. It shares the same architecture and vocabulary as the 31B, which means llama.cpp doesn’t need to translate tokens between models — a critical detail, since vocab translation mode kills all performance gains.

Why Strix Halo Is Different

On a discrete GPU like the RTX 5090, you have ~1.8 TB/s of dedicated VRAM bandwidth. The GPU is compute-rich relative to memory bandwidth, so verifying a longer batch of draft tokens is relatively cheap. The Reddit post found --draft-max 8 as the sweet spot, with higher values still beneficial for structured tasks.

On Strix Halo, both models share the same ~218 GB/s unified memory bus. Every byte the draft model reads competes with the main model. Longer draft sequences mean the verification batch is larger, consuming more bandwidth. This means the optimal draft length should be shorter — the compute savings of batch verification have to outweigh the bandwidth cost of running both models through the same pipe.

Benchmark Results

Five prompt types, 500 max tokens, temperature 0.7, warmup query discarded before measuring.

Baseline (No Speculative Decoding)

Query TypePP (t/s)TG (t/s)
Math60.66.17
Code57.66.17
Science66.76.17
Creative56.86.21
Translation76.96.17
Average63.76.18

The baseline TG is rock-solid at ~6.2 t/s regardless of content type. This is pure bandwidth math: 34 GB model / 218 GB/s ≈ 6.4 t/s theoretical, and we’re hitting ~96% of that ceiling.

draft-max Sweep

draft-maxMathCodeScienceCreativeTranslationAvg (t/s)vs Baseline
baseline6.176.176.176.216.176.18
212.5211.8510.108.709.8010.59+71%
416.6414.7412.098.2811.0412.55+103%
810.6610.158.746.569.229.06+47%
1613.8310.948.288.779.1910.20+65%

Per-Task Best Results

Query TypeBest SpeedBest draft-maxSpeedup
Math16.64 t/s4+170%
Code14.74 t/s4+139%
Science12.09 t/s4+96%
Translation11.04 t/s4+79%
Creative8.70 t/s2+40%

The Sweet Spot Is draft-max 4

This is the most interesting finding. On the RTX 5090, --draft-max 8 was optimal and 16 pushed math to 99 t/s. On Strix Halo, the curve peaks at --draft-max 4 and then regresses sharply at 8 before partially recovering at 16.

The pattern makes sense when you think about bandwidth pressure:

  • draft-max 2: Each speculative step drafts 2 tokens and verifies 3 (draft + 1 correction) in one batch. Low overhead, moderate savings. +71%.
  • draft-max 4: The sweet spot. The verification batch (up to 5 tokens) is still efficient on the bandwidth-limited bus, and acceptance rates are high enough that most drafts save multiple sequential decode steps. +103%.
  • draft-max 8: The verification batch gets expensive. Even with 35-62% acceptance rates, the bandwidth cost of verifying 9 tokens in a batch (reading 34 GB of weights for a larger batch) eats into the savings. +47%.
  • draft-max 16: Similar story but with occasional wins on highly predictable sequences (math at 13.8 t/s). The average barely beats draft-max 2.

On a discrete GPU with 8x the bandwidth, the verification cost is trivial and longer drafts are pure upside. On shared memory, shorter but more frequent speculation wins.

Draft Acceptance Rates

From the server logs during the --draft-max 8 run:

RequestAcceptance Rate
Warmup38.0% (19/50)
Math61.7% (436/707)
Code47.9% (406/847)
Science34.6% (315/910)
Creative34.7% (102/294)
Translation40.9% (316/773)

Math and code have the highest acceptance rates because their output is more structured and predictable. Creative writing is the hardest for the draft model to predict, which is why it sees the smallest speedup regardless of draft-max.

Trade-offs

There are real trade-offs with this approach:

  • No vision/multimodal: llama.cpp explicitly blocks speculative decoding when --mmproj is loaded. You can’t use image input with the speculative variants.
  • Single request only: --parallel 1 is mandatory. The draft model’s KV cache scales with parallel slots, and at higher parallelism it both uses more memory and tanks throughput.
  • Extra memory: The E2B draft model at Q8_K_XL adds ~5 GB. At Q4 it would be ~3 GB with only ~3.5% less speedup according to the Reddit benchmarks. On Strix Halo’s 128 GB this is negligible, but it matters on tighter systems.
  • Prompt processing unchanged: PP speed is identical with or without speculative decoding — it only affects token generation.

Configuration Changes

I added speculative decoding variants to my llama-swap config as separate model entries. The original multimodal-capable entries stay unchanged — I route to the spec variants when I want faster text generation and don’t need image input.

A new macro handles the draft model configuration:

"gemma4_spec_draft": "-md ${models_dir}/gemma-4-E2B-it-UD-Q8_K_XL.gguf -ngld 999 --draft-max 4 --draft-min 1 --parallel 1"

Then each speculative model entry drops --mmproj and adds ${gemma4_spec_draft}:

"gemma-4-31b-it-spec":
  cmd: >
    ${llama_server_base}
    -m ${models_dir}/gemma-4-31B-it-UD-Q8_K_XL.gguf
    ${gemma4_spec_draft}
    ${common_gpu_flags} ${common_nommap}
    --ctx-size 262144
    --cache-type-k q8_0 --cache-type-v q8_0
    --chat-template-file ${gemma4_template}
    --chat-template-kwargs "{\"enable_thinking\":true}"
    --temp 1.0 --top-p 0.95 --top-k 64 --repeat-penalty 1.0

I added four variants total: thinking and non-thinking for both the 31B dense and the 26B-A4B MoE. All are accessible through llama-swap and LiteLLM as openai/gemma-4-31b-it-spec, openai/gemma-4-31b-it-spec-nonthinking, and the 26B equivalents.

Strix Halo vs Discrete GPU: Different Optimization Strategies

This experiment highlights something important about Strix Halo inference tuning. Techniques developed on NVIDIA cards often need recalibration for unified memory architectures:

ParameterRTX 5090 (discrete)Strix Halo (unified)
Optimal draft-max84
Best avg speedup+29%+103%
Math peak99 t/s16.6 t/s
Bandwidth~1.8 TB/s GDDR7~218 GB/s LPDDR5X
BottleneckCompute → bandwidthBandwidth only

The larger percentage improvement on Strix Halo makes sense. The baseline is much lower (6.2 vs 57 t/s), so each saved decode step is a proportionally bigger win. The absolute numbers aren’t comparable — 16.6 t/s vs 99 t/s — but doubling your token generation speed on an iGPU with a one-line config change is a meaningful quality-of-life improvement.

Practical Takeaway

If you’re running Gemma 4 31B on Strix Halo (or any bandwidth-constrained unified memory system), speculative decoding with E2B as the draft model is the single biggest throughput improvement available — bigger than the GPU clock fix, bigger than KV cache quantization, bigger than batch tuning. Use --draft-max 4 instead of the 8 that discrete GPUs prefer, and keep --parallel 1. The trade-off is losing multimodal support on those endpoints, which is easy to solve by maintaining separate model entries in llama-swap.