In my previous post on speculative decoding, I showed that traditional draft-model speculation can double Gemma 4 31B speed on Strix Halo, but only when the draft model is from the same family. For Qwen 3.6 27B, there was no viable draft model. Cross-generation drafting (Qwen 3.5 as draft for 3.6) showed literally zero improvement. Draftless ngram decoding also did nothing. The dense 27B model was stuck at 6.2 t/s, hard up against the ~218 GB/s bandwidth wall.

Multi-Token Prediction changes everything. MTP is a speculative decoding method where the prediction heads are baked into the model weights, not requiring an external draft model at all. The result: Qwen 3.6 27B goes from 6.2 t/s to 29.8 t/s on my Strix Halo system. That’s a 4.8x speedup on a model that previously had no path to faster inference.

#What Is MTP and Why Is It Different?

Traditional speculative decoding runs two models: a small “draft” model proposes tokens, then the big “target” model verifies them in a batch. This requires the draft and target to share internal representations, not just vocabulary, or acceptance rates tank. For Qwen 3.6, no same-family small model exists (the smallest 3.6 variants are 27B and 35B-A3B), so traditional spec decoding was a dead end.

MTP takes a different approach. During training, the model learns extra “prediction head” layers that forecast future tokens from hidden states. These heads are included in the GGUF file itself, so at inference time, the model can draft its own continuations without an external model. The draft and target are literally the same network, which means acceptance rates are extremely high.

Here’s the comparison of every speculative method I’ve tested on Qwen 3.6 27B:

MethodDraft SourceTG Speedvs Baseline
Baseline (no spec)6.20 t/s
Cross-gen draft (Qwen 3.5 2B)External model6.19 t/s-0.2%
Cross-gen draft (Qwen 3.5 0.8B)External model6.19 t/s-0.2%
ngram-modToken history6.18 t/s-0.3%
MTP (Q4_K_M, n-max 5)Built-in heads29.79 t/s+380%
MTP (Q8_0, n-max 5)Built-in heads18.70 t/s+202%

The first three methods produced rounding-error changes. MTP is in a completely different category.

#The Setup

MTP support isn’t in mainline llama.cpp yet. It’s in PR #22673 by am17an, currently in draft. Regular GGUF files don’t contain MTP tensor layers either, you need specifically converted GGUFs. I used models from two sources:

Building the MTP branch with Vulkan support was straightforward:

git clone --branch mtp-clean --single-branch https://github.com/am17an/llama.cpp.git ~/llama-cpp-am17an
cd ~/llama-cpp-am17an && mkdir build && cd build
cmake .. -DGGML_VULKAN=ON -DCMAKE_BUILD_TYPE=Release -DGGML_VULKAN_COOPMAT=ON
make -j$(nproc) llama-server

The MTP branch includes a vulkan: add gdn keep_intermediates=true path commit, so Vulkan support is first-class, not an afterthought.

#Benchmark Results

All tests run on my Strix Halo system (128 GB LPDDR5X, Radeon 8060S via Vulkan RADV). Each configuration was tested with two prompts: a short coding task (24 prompt tokens, 512 generated tokens) and a longer library design task (97 prompt tokens, 256 generated tokens). Three runs per prompt after a warmup, results averaged.

Common flags across all tests: -fa on -ngl 999 -b 512 -ub 256 --no-mmap --ctx-size 100000 --parallel 1

#ConfigGGUF SizeKV Cachedraft-n-maxTG (short)TG (long)vs Baseline
1Baseline (no MTP)30 GB (Q8_K_XL)q8_06.20 t/s6.22 t/s
2MTP Q4_K_M15.4 GBq4_0227.23 t/s26.27 t/s4.4x
3MTP Q4_K_M15.4 GBq4_0529.79 t/s26.59 t/s4.8x
4MTP Q8_029 GBq8_0518.70 t/s17.79 t/s3.0x
5MTP Q8_029 GBq4_0520.51 t/s18.26 t/s3.3x

The clear winner for raw speed is configuration 3: Q4_K_M with q4_0 KV cache and --spec-draft-n-max 5, hitting 29.8 t/s on the short prompt. For quality-sensitive work like coding, configuration 4 or 5 (Q8_0 weights) still delivers a 3x speedup at ~19-20 t/s.

#Why Quantization Matters More With MTP

The relationship between model size and MTP speedup is worth understanding. On Strix Halo, inference is entirely bandwidth-bound. The baseline Q8_K_XL model at 30 GB needs to stream those weights through ~218 GB/s of memory for every token, landing at ~6.2 t/s. MTP lets multiple tokens be verified per forward pass, but each forward pass still costs the same bandwidth.

Smaller quants win twice:

  1. Fewer bytes per forward pass — Q4_K_M at 15.4 GB is roughly half the bandwidth demand of Q8_K_XL, so each forward pass completes faster
  2. MTP multiplies the benefit — if each pass verifies 2-3 tokens instead of 1, the per-token bandwidth cost is divided by the acceptance rate, and you’re dividing a smaller number

This is why Q4_K_M reaches 29.8 t/s (4.8x) while Q8_0 reaches 18.7 t/s (3.0x). The MTP multiplier is similar, but it’s applied to a faster base. On a system with higher bandwidth (like a 3090’s 936 GB/s), the quant difference would be less dramatic.

#draft-n-max: 5 vs 2

The Qwen 3.6 27B model has nextn_predict_layers=1, meaning one prediction head producing one extra token per step. Increasing --spec-draft-n-max from 2 to 5 gave a modest improvement (+9% on short prompts, roughly equal on long prompts). This is a smaller gain than what I saw with traditional spec decoding on Gemma 4, where the draft-max sweep had a much more pronounced curve. With MTP, the model’s single prediction head likely limits how many useful tokens each step can draft, so going above 5 probably won’t help.

#Comparison to Other Hardware

HardwareMemory BWQuantMTP ConfigTG Speed
Strix Halo218 GB/sQ4_K_Mdraft-n-max 529.8 t/s
Strix Halo218 GB/sQ8_0draft-n-max 518.7-20.5 t/s
3090 (CUDA)936 GB/sQ4_K_Mdraft-n-max 250 t/s
M2 Max 96 GB400 GB/sQ5_K_Mdraft-n-max 528 t/s

Strix Halo Q4_K_M performance (29.8 t/s) is comparable to the M2 Max (28 t/s) despite having roughly half the raw bandwidth. The MTP branch’s Vulkan cooperative matrix optimizations likely help close that gap. The 3090 at 50 t/s has over 4x the bandwidth, so its lead is expected.

#Fixed Chat Templates

While setting this up, I also found froggeric’s fixed chat templates for Qwen 3.6, which patch seven bugs in the official template. The important ones for agentic use:

  • Tool calls fail on C++ engines — the official template uses Python-only |items filter
  • developer role rejected — modern APIs send this and the official template crashes
  • Empty thinking blocks spam context — every past turn gets wrapped in tags, even with nothing inside
  • </thinking> hallucination — the model sometimes outputs the wrong closing tag, crashing the parser
  • Unclosed thinking before tool calls — model starts reasoning then calls a tool without closing the block

If you’re using Qwen 3.6 with llama.cpp for agentic workflows, the fixed template is worth grabbing.

#Limitations

  • Draft PR: MTP support is not yet in mainline llama.cpp. I’m running am17an’s mtp-clean branch, built from source with Vulkan
  • --parallel 1 required: MTP only supports single-slot mode, so no concurrent requests through llama-swap
  • No vision: mmproj crashes when used alongside MTP (reported 2026-05-06 in the PR)
  • Separate binary: I’m running the custom-built binary directly on the host rather than through my usual Vulkan toolbox container

These are all solvable once the PR matures and gets merged.

#Practical Takeaway

If you have a Qwen 3.6 27B setup on any bandwidth-constrained hardware, MTP is the biggest inference speedup available. It succeeds where traditional speculative decoding methods completely fail on this model. The setup is a bit involved, building a fork, downloading special GGUFs, but the payoff is real.

For my Strix Halo system, the recommendations are:

Use caseConfigExpected TG
Max speed (chat, RAG)Q4_K_M, q4_0 KV, n-max 5~30 t/s
Quality-sensitive (coding)Q8_0, q8_0 KV, n-max 5~19 t/s
BalanceQ8_0, q4_0 KV, n-max 5~20 t/s

I’ll be watching PR #22673 closely. Once it merges into mainline, this moves from “experiment” to “default configuration” for Qwen 3.6 27B on Strix Halo.