This is a sequel to two earlier posts: speculative decoding with Gemma 4 E2B (which doubled 31B speed but hurt the MoE model) and MTP with Qwen 3.6 (which achieved 4.8x speedup using prediction heads baked into model weights). The question this time: can Google’s official MTP assistant heads, tiny ~0.5B models designed specifically for Gemma 4, do what neither traditional draft models nor weight-baked MTP could do for the full Gemma 4 lineup?

The answer is yes. Gemma 4 31B goes from 6.2 t/s to 22.9 t/s (3.7x). And for the first time, Gemma 4 26B-A4B (MoE) gets a meaningful speedup: 43.6 t/s to 63.2 t/s (+45%), after traditional speculative decoding made it slower.

#How MTP Assistants Differ from Draft Models

In my earlier speculative decoding test, I used the full Gemma 4 E2B (~2B parameters, ~5 GB) as a draft model for the 31B. This worked well for the dense model (+100%) but was a disaster for the 26B-A4B MoE (-27%). The problem was simple: loading 5 GB of draft weights through the same 218 GB/s memory bus competing with the main model’s weights negated any speculation benefit when the main model was already fast.

Google’s MTP assistant is a fundamentally different animal:

  • Tiny: ~0.5B for the 31B assistant, ~0.4B for the 26B-A4B. Only ~310-337 MB on disk at Q4_K_M.
  • No separate KV cache: The assistant reads the target model’s KV cache directly via cross-attention. Zero additional KV memory.
  • No separate context: It’s not a second model in the traditional sense. It’s loaded into the target’s context and uses the target’s last hidden state as input.
  • Async pipeline: Draft compute overlaps with server bookkeeping, so the tiny overhead is partially hidden.

This is implemented in atomic-llama-cpp-turboquant, a llama.cpp fork that adds the gemma4_assistant architecture. It won’t load in upstream llama.cpp; you need this specific fork.

#The Setup

ComponentSpecification
CPU/GPUAMD Ryzen AI MAX+ 395 / Radeon 8060S (Strix Halo)
RAM128 GB LPDDR5X unified memory (~218 GB/s)
BackendVulkan (Mesa RADV, GFX1151), cooperative matrix enabled
Forkatomic-llama-cpp-turboquant build 8995
Target modelsQ8_K_XL (33 GB for 31B, 26 GB for 26B-A4B)
Assistant modelsQ4_K_M (~337 MB for 31B, ~310 MB for 26B-A4B)

The assistant GGUFs came from the AtomicChat collection on Hugging Face. Q4_K_M is the recommended quantization. At this model size, bandwidth dominates over weight precision, so throughput is identical to F16 while using 4x less memory.

Building the fork with Vulkan was the same process as any llama.cpp build:

git clone https://github.com/AtomicBot-ai/atomic-llama-cpp-turboquant.git ~/atomic-llama-cpp
cd ~/atomic-llama-cpp
cmake -B build -DGGML_VULKAN=ON -DCMAKE_BUILD_TYPE=Release -DGGML_VULKAN_COOPMAT=ON
cmake --build build --target llama-server -j$(nproc)

Running with MTP adds two flags to a standard llama-server invocation:

llama-server \
  -m gemma-4-31B-it-UD-Q8_K_XL.gguf \
  --mtp-head gemma-4-31B-it-assistant.Q4_K_M.gguf \
  --spec-type mtp \
  --draft-block-size 5 \
  -fa on -ngl 999 -ngld 99 --no-mmap \
  --cache-type-k q8_0 --cache-type-v q8_0 \
  --parallel 1 --ctx-size 16384

#Benchmark Results

All tests: 3 runs per prompt, 512 generated tokens, temperature 0.0, warmup request before measuring. Common flags: -fa on -ngl 999 -ngld 999 -b 512 -ub 256 --no-mmap --ctx-size 16384 --parallel 1 --cache-type-k q8_0 --cache-type-v q8_0

Two prompts:

  • Fibonacci: “Write a Python program to find the nth Fibonacci number using recursion”
  • MergeSort: “Write a complete merge sort implementation in Python with type hints, error handling, and unit tests”

#Gemma 4 26B-A4B (MoE, ~4B active)

ConfigFibonacciMergeSortvs Baseline
Baseline (no MTP)43.60 t/s43.58 t/s-
MTP block-size 253.29 t/s52.92 t/s+22%
MTP block-size 359.17 t/s58.80 t/s+36%
MTP block-size 463.22 t/s61.15 t/s+45%
MTP block-size 561.97 t/s61.25 t/s+42%
MTP + turbo3 KV21.88 t/s21.78 t/s-50%

For context, my previous test with the full E2B draft model on this same 26B-A4B MoE gave 29.79 t/s, a 27% regression from baseline. The MTP assistant at 63.2 t/s is more than double what draft-model speculation achieved.

#Gemma 4 31B (Dense)

ConfigFibonacciMergeSortvs Baseline
Baseline (no MTP)6.24 t/s6.24 t/s-
MTP block-size 211.60 t/s11.02 t/s+86%
MTP block-size 316.07 t/s14.58 t/s+158%
MTP block-size 419.35 t/s17.32 t/s+210%
MTP block-size 522.85 t/s18.89 t/s+266%
MTP + turbo3 KV5.72 t/s5.62 t/s-8%

My previous tests gave 12.3 t/s with E2B draft-model speculation (+100%). The MTP assistant at 22.9 t/s is nearly double that, and 3.7x the raw baseline.

#The Full Speculative Decoding Leaderboard for Strix Halo

Every speculative method I’ve tested on these Gemma 4 models, ranked:

ModelMethodTG Speedvs Baseline
26B-A4BMTP assistant (bs=4)63.2 t/s+45%
26B-A4BBaseline43.6 t/s-
26B-A4BE2B draft model29.8 t/s-27%
31BMTP assistant (bs=5)22.9 t/s+266%
31BE2B draft model12.3 t/s+100%
31BBaseline6.2 t/s-

#What I Learned Tuning Parameters

I ran a full parameter sweep to find the best configuration. The results were surprisingly simple.

#Block-size is the only knob that matters

--draft-block-size controls how many tokens the assistant predicts per round (B-1 tokens for block-size B). This was the only parameter with a meaningful impact:

  • On 26B-A4B (MoE), performance peaks at block-size 4 then slightly declines. The model is already fast at 44 t/s, so each additional draft step adds diminishing compute savings.
  • On 31B (dense), performance monotonically improves through block-size 5 with ~3 t/s per step. The model is so bandwidth-bound at 6.2 t/s that every accepted draft token saves an expensive weight read. Higher values likely still help.

#Everything else is noise

  • draft-max (4, 8, 16): No measurable difference at any setting. Block-size dominates.
  • f16 vs q8_0 KV cache: Functionally identical.
  • Pipeline depth (async depth-2 vs sync): ~0.5% difference. The async overlap benefit documented in the fork’s MTP.md (~8% on M4 Max) is negligible on Vulkan RADV.

#TurboQuant turbo3 is a hard no on Vulkan

The fork also ships TurboQuant, a KV cache compression scheme using Walsh-Hadamard transforms. On the Apple M4 Max (where it was developed), turbo3 KV helps bandwidth-bound models by reducing KV traffic. On my Vulkan RADV setup, it’s catastrophic: -50% on 26B-A4B, -8% on 31B. The TurboFlash decode kernel is Metal-native; Vulkan falls back to a slow reference path. Stick with q8_0.

#Why MTP Assistants Succeed on the MoE Model

This deserves emphasis because it’s the most important result. Previous speculative methods all failed on the 26B-A4B MoE:

  • E2B draft model: Loading 5 GB of draft weights through the same 218 GB/s bus competes with the main model. The MoE only activates ~4B params per token (already fast at 44 t/s), so the draft overhead exceeds the speculation benefit. Net result: -27%.
  • ngram-mod: No token history to pattern-match against for novel generation. No improvement.

The MTP assistant succeeds because it’s architecturally cheap:

  1. ~0.4B vs ~2B: The assistant is 5x smaller than the E2B draft model. Memory bus contention is negligible.
  2. Cross-attention: The assistant reads the target’s KV cache directly. No separate KV allocation, no separate cache management.
  3. Shared hidden state: The assistant consumes the target’s backbone output. It’s essentially a lightweight decoder head attached to the main model, not an independent model.

The result is that draft overhead per round is small enough to be amortized even when the main model is already generating at 44 t/s.

#Practical Configuration

I’ve added MTP assistant entries to my llama-swap configuration. The key differences from standard entries:

  • Uses the atomic fork binary instead of the toolbox or the am17an MTP fork
  • Adds --mtp-head pointing to the assistant GGUF and --spec-type mtp
  • Sets --draft-block-size 4 for 26B-A4B, --draft-block-size 5 for 31B
  • Drops --mmproj (not tested with MTP) and --cache-ram (incompatible with --parallel 1)
  • Context reduced to 16384 (single-slot MTP is for fast generation, not long-context use)

The MTP entries coexist with the existing multimodal-capable entries and the E2B speculative entries. The router picks the right one based on the model alias.

#How This Compares to Qwen 3.6 MTP

I now have two different MTP approaches running side by side:

Qwen 3.6 (am17an)Gemma 4 (atomic fork)
MethodPrediction heads in model weightsSeparate assistant GGUF
Best speed29.8 t/s (4.8x)22.9 t/s (3.7x) for 31B, 63.2 t/s (+45%) for 26B
Model size impactLarger GGUFs (heads baked in)Standard target + tiny assistant
FlexibilityModel-specific MTP GGUFs requiredWorks with any Gemma 4 target GGUF
MoE supportN/A (Qwen 3.6 27B is dense)Yes, first MoE speedup
Mainline statusPR #22673 (not merged)Fork (custom architecture)

They’re complementary. The am17an approach gives higher gains on Qwen’s dense 27B because the prediction heads are more tightly integrated. The atomic approach works across the entire Gemma 4 family without needing special GGUFs for the target model.

#Bottom Line

If you’re running Gemma 4 on bandwidth-limited hardware, MTP assistants are the best speculative decoding technique available. They beat draft-model speculation on every model, and they’re the first technique that improves MoE throughput. The setup is a clone, build, download, and two extra flags.

For Strix Halo specifically:

  • 26B-A4B: --draft-block-size 4 → 63 t/s (previously stuck at 44 with no viable speedup)
  • 31B: --draft-block-size 5 → 23 t/s (previously 12.3 with E2B draft, 6.2 baseline)