Gemma 4 MTP Assistant: 3.7x Faster 31B and +45% Faster 26B-A4B on Strix Halo

This is a sequel to two earlier posts: speculative decoding with Gemma 4 E2B (which doubled 31B speed but hurt the MoE model) and MTP with Qwen 3.6 (which achieved 4.8x speedup using prediction heads baked into model weights). The question this time: can Google’s official MTP assistant heads, tiny ~0.5B models designed specifically for Gemma 4, do what neither traditional draft models nor weight-baked MTP could do for the full Gemma 4 lineup?

The answer is yes. Gemma 4 31B goes from 6.2 t/s to 22.9 t/s (3.7x). And for the first time, Gemma 4 26B-A4B (MoE) gets a meaningful speedup: 43.6 t/s to 63.2 t/s (+45%), after traditional speculative decoding made it slower.

#How MTP Assistants Differ from Draft Models

In my earlier speculative decoding test, I used the full Gemma 4 E2B (~2B parameters, ~5 GB) as a draft model for the 31B. This worked well for the dense model (+100%) but was a disaster for the 26B-A4B MoE (-27%). The problem was simple: loading 5 GB of draft weights through the same 218 GB/s memory bus competing with the main model’s weights negated any speculation benefit when the main model was already fast.

Google’s MTP assistant is a fundamentally different animal:

Tiny: ~0.5B for the 31B assistant, ~0.4B for the 26B-A4B. Only ~310-337 MB on disk at Q4_K_M.
No separate KV cache: The assistant reads the target model’s KV cache directly via cross-attention. Zero additional KV memory.
No separate context: It’s not a second model in the traditional sense. It’s loaded into the target’s context and uses the target’s last hidden state as input.
Async pipeline: Draft compute overlaps with server bookkeeping, so the tiny overhead is partially hidden.

This is implemented in atomic-llama-cpp-turboquant, a llama.cpp fork that adds the gemma4_assistant architecture. It won’t load in upstream llama.cpp; you need this specific fork.

#The Setup

Component	Specification
CPU/GPU	AMD Ryzen AI MAX+ 395 / Radeon 8060S (Strix Halo)
RAM	128 GB LPDDR5X unified memory (~218 GB/s)
Backend	Vulkan (Mesa RADV, GFX1151), cooperative matrix enabled
Fork	atomic-llama-cpp-turboquant build 8995
Target models	Q8_K_XL (33 GB for 31B, 26 GB for 26B-A4B)
Assistant models	Q4_K_M (~337 MB for 31B, ~310 MB for 26B-A4B)

The assistant GGUFs came from the AtomicChat collection on Hugging Face. Q4_K_M is the recommended quantization. At this model size, bandwidth dominates over weight precision, so throughput is identical to F16 while using 4x less memory.

Building the fork with Vulkan was the same process as any llama.cpp build:

git clone https://github.com/AtomicBot-ai/atomic-llama-cpp-turboquant.git ~/atomic-llama-cpp
cd ~/atomic-llama-cpp
cmake -B build -DGGML_VULKAN=ON -DCMAKE_BUILD_TYPE=Release -DGGML_VULKAN_COOPMAT=ON
cmake --build build --target llama-server -j$(nproc)

Running with MTP adds two flags to a standard llama-server invocation:

llama-server \
  -m gemma-4-31B-it-UD-Q8_K_XL.gguf \
  --mtp-head gemma-4-31B-it-assistant.Q4_K_M.gguf \
  --spec-type mtp \
  --draft-block-size 5 \
  -fa on -ngl 999 -ngld 99 --no-mmap \
  --cache-type-k q8_0 --cache-type-v q8_0 \
  --parallel 1 --ctx-size 16384

#Benchmark Results

All tests: 3 runs per prompt, 512 generated tokens, temperature 0.0, warmup request before measuring. Common flags: -fa on -ngl 999 -ngld 999 -b 512 -ub 256 --no-mmap --ctx-size 16384 --parallel 1 --cache-type-k q8_0 --cache-type-v q8_0

Two prompts:

Fibonacci: “Write a Python program to find the nth Fibonacci number using recursion”
MergeSort: “Write a complete merge sort implementation in Python with type hints, error handling, and unit tests”

#Gemma 4 26B-A4B (MoE, ~4B active)

Config	Fibonacci	MergeSort	vs Baseline
Baseline (no MTP)	43.60 t/s	43.58 t/s	-
MTP block-size 2	53.29 t/s	52.92 t/s	+22%
MTP block-size 3	59.17 t/s	58.80 t/s	+36%
MTP block-size 4	63.22 t/s	61.15 t/s	+45%
MTP block-size 5	61.97 t/s	61.25 t/s	+42%
MTP + turbo3 KV	21.88 t/s	21.78 t/s	-50%

For context, my previous test with the full E2B draft model on this same 26B-A4B MoE gave 29.79 t/s, a 27% regression from baseline. The MTP assistant at 63.2 t/s is more than double what draft-model speculation achieved.

#Gemma 4 31B (Dense)

Config	Fibonacci	MergeSort	vs Baseline
Baseline (no MTP)	6.24 t/s	6.24 t/s	-
MTP block-size 2	11.60 t/s	11.02 t/s	+86%
MTP block-size 3	16.07 t/s	14.58 t/s	+158%
MTP block-size 4	19.35 t/s	17.32 t/s	+210%
MTP block-size 5	22.85 t/s	18.89 t/s	+266%
MTP + turbo3 KV	5.72 t/s	5.62 t/s	-8%

My previous tests gave 12.3 t/s with E2B draft-model speculation (+100%). The MTP assistant at 22.9 t/s is nearly double that, and 3.7x the raw baseline.

#The Full Speculative Decoding Leaderboard for Strix Halo

Every speculative method I’ve tested on these Gemma 4 models, ranked:

Model	Method	TG Speed	vs Baseline
26B-A4B	MTP assistant (bs=4)	63.2 t/s	+45%
26B-A4B	Baseline	43.6 t/s	-
26B-A4B	E2B draft model	29.8 t/s	-27%
31B	MTP assistant (bs=5)	22.9 t/s	+266%
31B	E2B draft model	12.3 t/s	+100%
31B	Baseline	6.2 t/s	-

#What I Learned Tuning Parameters

I ran a full parameter sweep to find the best configuration. The results were surprisingly simple.

#Block-size is the only knob that matters

--draft-block-size controls how many tokens the assistant predicts per round (B-1 tokens for block-size B). This was the only parameter with a meaningful impact:

On 26B-A4B (MoE), performance peaks at block-size 4 then slightly declines. The model is already fast at 44 t/s, so each additional draft step adds diminishing compute savings.
On 31B (dense), performance monotonically improves through block-size 5 with ~3 t/s per step. The model is so bandwidth-bound at 6.2 t/s that every accepted draft token saves an expensive weight read. Higher values likely still help.

#Everything else is noise

draft-max (4, 8, 16): No measurable difference at any setting. Block-size dominates.
f16 vs q8_0 KV cache: Functionally identical.
Pipeline depth (async depth-2 vs sync): ~0.5% difference. The async overlap benefit documented in the fork’s MTP.md (~8% on M4 Max) is negligible on Vulkan RADV.

#TurboQuant turbo3 is a hard no on Vulkan

The fork also ships TurboQuant, a KV cache compression scheme using Walsh-Hadamard transforms. On the Apple M4 Max (where it was developed), turbo3 KV helps bandwidth-bound models by reducing KV traffic. On my Vulkan RADV setup, it’s catastrophic: -50% on 26B-A4B, -8% on 31B. The TurboFlash decode kernel is Metal-native; Vulkan falls back to a slow reference path. Stick with q8_0.

#Why MTP Assistants Succeed on the MoE Model

This deserves emphasis because it’s the most important result. Previous speculative methods all failed on the 26B-A4B MoE:

E2B draft model: Loading 5 GB of draft weights through the same 218 GB/s bus competes with the main model. The MoE only activates ~4B params per token (already fast at 44 t/s), so the draft overhead exceeds the speculation benefit. Net result: -27%.
ngram-mod: No token history to pattern-match against for novel generation. No improvement.

The MTP assistant succeeds because it’s architecturally cheap:

~0.4B vs ~2B: The assistant is 5x smaller than the E2B draft model. Memory bus contention is negligible.
Cross-attention: The assistant reads the target’s KV cache directly. No separate KV allocation, no separate cache management.
Shared hidden state: The assistant consumes the target’s backbone output. It’s essentially a lightweight decoder head attached to the main model, not an independent model.

The result is that draft overhead per round is small enough to be amortized even when the main model is already generating at 44 t/s.

#Practical Configuration

I’ve added MTP assistant entries to my llama-swap configuration. The key differences from standard entries:

Uses the atomic fork binary instead of the toolbox or the am17an MTP fork
Adds --mtp-head pointing to the assistant GGUF and --spec-type mtp
Sets --draft-block-size 4 for 26B-A4B, --draft-block-size 5 for 31B
Drops --mmproj (not tested with MTP) and --cache-ram (incompatible with --parallel 1)
Context reduced to 16384 (single-slot MTP is for fast generation, not long-context use)

The MTP entries coexist with the existing multimodal-capable entries and the E2B speculative entries. The router picks the right one based on the model alias.

#How This Compares to Qwen 3.6 MTP

I now have two different MTP approaches running side by side:

	Qwen 3.6 (am17an)	Gemma 4 (atomic fork)
Method	Prediction heads in model weights	Separate assistant GGUF
Best speed	29.8 t/s (4.8x)	22.9 t/s (3.7x) for 31B, 63.2 t/s (+45%) for 26B
Model size impact	Larger GGUFs (heads baked in)	Standard target + tiny assistant
Flexibility	Model-specific MTP GGUFs required	Works with any Gemma 4 target GGUF
MoE support	N/A (Qwen 3.6 27B is dense)	Yes, first MoE speedup
Mainline status	PR #22673 (not merged)	Fork (custom architecture)

They’re complementary. The am17an approach gives higher gains on Qwen’s dense 27B because the prediction heads are more tightly integrated. The atomic approach works across the entire Gemma 4 family without needing special GGUFs for the target model.

#Bottom Line

If you’re running Gemma 4 on bandwidth-limited hardware, MTP assistants are the best speculative decoding technique available. They beat draft-model speculation on every model, and they’re the first technique that improves MoE throughput. The setup is a clone, build, download, and two extra flags.

For Strix Halo specifically:

26B-A4B: --draft-block-size 4 → 63 t/s (previously stuck at 44 with no viable speedup)
31B: --draft-block-size 5 → 23 t/s (previously 12.3 with E2B draft, 6.2 baseline)