Testing llama.cpp PR #21344: Faster MoE Prefill, but MTP Fights Back

A comment on llama.cpp PR #21344 caught my attention last week. The PR modifies two CUDA kernel files, fattn-tile.cuh and mmq.cuh, with optimizations targeting GFX1151 (the GPU inside Strix Halo). The reported numbers showed meaningful prefill gains on MoE models, exactly the architecture powering my two heaviest daily-driver models: Qwen3.6-35B-A3B and Gemma 4 26B-A4B.

The question was simple: do these kernel changes help my production workloads? I set up a patched toolbox container, ran llama-bench side-by-side with the stock build, and got a clear answer. Then I asked a harder question: what happens when you combine these kernel optimizations with MTP speculative decoding, which is how I actually run my primary Qwen model? That answer was less convenient.

#The Setup

My inference stack runs llama.cpp inside Podman toolbox containers managed by llama-swap. To test the PR without touching my production baseline, I built a second toolbox image from the pedapudi/llama.cpp:gfx1151-opt branch containing the PR changes.

Component	Specification
CPU/GPU	AMD Ryzen AI MAX+ 395 / Radeon 8060S (Strix Halo, GFX1151)
RAM	128 GB LPDDR5X unified memory (~218 GB/s)
Backend	ROCm 7.2.3 (HIP)
Baseline	`llama-rocm-7.2.3`, build 9330 (`328874d05`)
Patched	`llama-rocm-7.2.3-moe-pr21344`, build 9106 (`3f496a6ee`, `pedapudi:gfx1151-opt`)
Models	Qwen3.6-35B-A3B Q8_K_XL, Gemma 4 26B-A4B Q8_K_XL

Both containers ran identical llama-bench flags: -ngl 99 -fa 1 -mmp 0 -dio 1 -b 2048 -ub 2048, 3 runs per test. Five workloads covering pure prefill, pure decode, and mixed prompt+generation at increasing context lengths.

#MoE Prefill Gains Are Real

#Qwen3.6-35B-A3B

Test	Stock (t/s)	Patched (t/s)	Delta
pp512	1,083.55	1,339.37	+23.6%
tg128	45.82	45.63	-0.4%
pp2048+tg128	1,267.45	1,311.96	+3.5%
pp8192+tg128	1,156.96	1,272.93	+10.0%
pp32768+tg128	573.40	629.64	+9.8%

#Gemma 4 26B-A4B

Test	Stock (t/s)	Patched (t/s)	Delta
pp512	1,369.67	1,698.92	+24.0%
tg128	41.61	41.63	+0.1%
pp2048+tg128	1,819.94	1,914.33	+5.2%
pp8192+tg128	1,525.56	1,598.49	+4.8%
pp32768+tg128	600.61	599.52	-0.2%

The pattern is consistent across both models. Short prefill (pp512) sees the biggest gains at +24%. As context length increases, the improvement tapers because attention cost starts to dominate over the matrix multiplication kernels that the PR optimizes. Decode is completely neutral, which makes sense: token generation is bandwidth-bound, not compute-bound, so faster GEMM kernels don’t help.

Average prefill-oriented uplift: +11.7% for Qwen, +8.5% for Gemma. Those are real wins for any workload that involves processing large prompts, summarization, RAG, or long-context conversations.

#The MTP Complication

The prefill numbers made the PR look like a free upgrade. But my primary Qwen3.6-35B-A3B configuration isn’t a vanilla llama-server instance. It runs with MTP speculative decoding, which is how I get that model from ~46 t/s to ~63 t/s for interactive use. The production entry uses MTP-specific GGUFs with prediction heads baked into the weights, --spec-type draft-mtp, and --spec-draft-n-max 3.

The PR branch was built from an older point in llama.cpp history that predates MTP support entirely. So the first step was surgical: I extracted just the two kernel file changes (fattn-tile.cuh and mmq.cuh) as a patch and applied them on top of upstream llama.cpp master, which does have MTP. That gave me a toolbox container with both features active.

Then I ran the real test: stock MTP vs. PR-patched MTP, same model, same flags, same Pi profile.

#MTP Benchmark Sweep

Each configuration ran 5 measured passes with 2 warmups, across --spec-draft-n-max values of 1, 2, and 3. Two prompt lengths: a 512-token “short” prompt and a 256-token generation from a longer context (“long”).

nmax	Test	Stock MTP (t/s)	PR+MTP (t/s)	Prompt Delta	Decode Delta
1	short	113.90 pp / 63.46 tg	112.11 pp / 61.55 tg	-1.6%	-3.0%
1	long	148.50 pp / 60.13 tg	142.23 pp / 58.36 tg	-4.2%	-2.9%
2	short	113.84 pp / 63.34 tg	110.54 pp / 61.26 tg	-2.9%	-3.3%
2	long	147.92 pp / 60.26 tg	140.51 pp / 57.95 tg	-5.0%	-3.8%
3	short	112.74 pp / 63.26 tg	109.89 pp / 61.67 tg	-2.5%	-2.5%
3	long	147.61 pp / 59.12 tg	139.47 pp / 58.26 tg	-5.5%	-1.4%

Every cell is negative. PR+MTP is slower than stock MTP at every nmax value and both prompt lengths, by 1.4% to 5.5%.

#Why the Kernel Changes Hurt MTP

The prefill-only benchmark showed the PR’s GEMM kernel changes clearly help compute-bound matrix multiplication. But MTP changes the bottleneck profile in ways that make those same optimizations counterproductive.

MTP speculative decoding runs a tight loop: draft tokens from prediction heads, then verify in a batch. The verification step is a small batched prefill (3-4 tokens at nmax=3), not a 512+ token bulk prefill. At that tiny batch size, the overhead of the PR’s modified kernel dispatch path, optimized for larger tile sizes, likely exceeds any compute savings. The original kernels were presumably better tuned for the small-batch verification passes that dominate MTP’s runtime.

There’s also a scheduling dimension. MTP interleaves draft head forward passes with verification batches on the same GPU. The PR’s kernel changes may alter occupancy or register pressure in ways that are fine for sustained bulk compute but cause pipeline stalls when the GPU rapidly switches between MTP’s draft and verify phases.

The consistent -3% to -5% on prompt processing (which during MTP means the verification batch, not raw prefill) supports this interpretation. The PR isn’t catastrophic, but it’s friction in a pipeline that’s already been optimized for small-batch latency.

#What I Changed (and Didn’t)

The llama-bench prefill gains are real and useful, so I added the PR-patched models to my llama-swap and LiteLLM configuration as opt-in aliases:

local/Qwen3.6-35B-A3B-PR21344
local/Gemma-4-26B-A4B-IT-PR21344

These use the patched toolbox container with the GFX1151-optimized kernels, but without MTP. Good for workloads that are prefill-heavy: summarization, long document processing, RAG retrieval.

My production MTP entries (the “Pi” configurations from the MTP post) stay on the stock llama.cpp build. The 3-5% regression from the PR isn’t worth trading against the ~35% speed gain MTP already provides over baseline decode.

#Bottom Line

PR #21344 delivers exactly what it advertises for GFX1151: faster GEMM kernels that speed up compute-bound MoE prefill by 8-24%. If you’re running MoE models on Strix Halo without speculative decoding, it’s a straightforward win.

But optimizations don’t always compose. The same kernel changes that help bulk prefill actively hurt MTP’s small-batch verification loop, by a consistent 3-5% across every configuration I tested. If your production setup already uses MTP (and on bandwidth-limited hardware like Strix Halo, it should), applying these kernel patches makes inference slower.

Both configurations now live side-by-side in my llama-swap stack. The router picks the right one based on the model alias: PR-patched for prefill-heavy use, stock for interactive MTP. That’s the practical answer when optimizations conflict: don’t pick one, route around the tradeoff.

#Links

PR #21344: ggml-org/llama.cpp#21344
MTP speculative decoding on Strix Halo: MTP Speculative Decoding: 4.8x Faster Qwen 3.6 27B
Local inference infrastructure: Local LLM Infrastructure on Strix Halo
Toolbox images: kyuz0/amd-strix-halo-toolboxes
Gemma 4 MTP assistant benchmarks: Gemma 4 MTP Assistant on Strix Halo