BeeLlama.cpp DFlash on Strix Halo: 2.7x Gemma 31B, But MTP Is Still Faster

BeeLlama.cpp is a performance-focused llama.cpp fork that adds DFlash speculative decoding, adaptive draft control, TurboQuant/TCQ KV-cache compression, and reasoning-loop protection. The headline feature is DFlash, where a small draft model cross-attends to the target model’s hidden states and proposes tokens for batch verification. On NVIDIA hardware, the project’s benchmarks show 3-5x speedups on structured generation tasks. I wanted to see what it does on Strix Halo’s bandwidth-limited unified memory, and how it stacks up against the MTP approaches I’m already running.

The short version: DFlash gives a real 2-2.7x speedup over baseline llama.cpp on dense models. That’s a meaningful improvement. But my existing MTP setups, both weight-baked MTP for Qwen 3.6 and MTP assistants for Gemma 4, are still 23-67% faster than DFlash in a strict apples-to-apples comparison.

#What Is DFlash?

DFlash is a speculative decoding variant where a purpose-built draft model reads the target model’s hidden states through a cross-attention ring buffer, then proposes multiple tokens ahead. The target verifies them in a single forward pass, accepting correct predictions for free. This is fundamentally different from the MTP approaches I’ve been using:

MTP (weight-baked): Prediction heads are part of the model weights themselves. No external draft model needed, but requires specially converted MTP GGUFs.
MTP (assistant): A tiny (~0.5B) assistant model shares the target’s KV cache via cross-attention. Requires a fork-specific assistant GGUF.
DFlash: A separate draft GGUF with DFlash-specific metadata and cross-attention weights. The draft model is small (~1 GB) but fully separate from the target.

All three are lossless, the target model always has final say on output.

#Setup

Component	Specification
CPU/GPU	AMD Ryzen AI MAX+ 395 / Radeon 8060S (Strix Halo)
RAM	128 GB LPDDR5X unified memory (~218 GB/s)
Backend	Vulkan (Mesa RADV, GFX1151), cooperative matrix enabled
BeeLlama.cpp	Built from source with `-DGGML_VULKAN=ON -DGGML_NATIVE=ON -DGGML_VULKAN_COOPMAT=ON`
Baseline	toolbox `llama-vulkan-radv` (stock llama.cpp, Vulkan)
MTP baseline	toolbox `llama-rocm-7.2.3` with `--spec-type draft-mtp`

#Build Notes

Building BeeLlama on Fedora required installing SPIRV-Headers locally first, since the Vulkan backend has a hard CMake dependency on find_package(SPIRV-Headers) that Fedora’s Vulkan packages don’t satisfy. After cloning SPIRV-Headers to ~/.local, the build went through cleanly with -DCMAKE_PREFIX_PATH and explicit -I flags.

#DFlash Draft Models

This tripped me up initially. DFlash requires purpose-built draft GGUFs with cross-attention metadata baked in. You cannot use a regular small model from the same family (like Gemma 4 E2B) or an MTP assistant GGUF. Attempting to load a standard model as a DFlash drafter fails with:

draft model is not a valid DFlash drafter: missing complete DFlash metadata

The DFlash draft GGUFs are published by the project author:

Anbeeld/gemma-4-31B-it-DFlash-GGUF (IQ4_XS, ~836 MB)
Anbeeld/Qwen3.6-27B-DFlash-GGUF (Q4_K_M, ~1 GB)

#Platform Caveat

BeeLlama’s docs are upfront about this: Vulkan is “not recommended for DFlash.” The GPU cross-attention ring buffer that makes DFlash fast on CUDA falls back to a CPU ring path on Vulkan. TurboQuant cache types are also unavailable on Vulkan. So these results represent DFlash running in a degraded mode compared to what CUDA users see. The 3-5x speedups in the project’s benchmarks were measured on an RTX 3090 with the GPU ring path.

#Benchmark Results

All tests used the same harness: curl against the llama-server OpenAI-compatible API, stream: false, temperature: 0.0, 1 warmup request, 3 measured runs per test. Two prompts: a short coding task (512 generated tokens) and a longer library design task (256 generated tokens).

Common flags: -fa on -ngl 999 -b 512 -ub 256 --no-mmap --ctx-size 16384 --parallel 1 --cache-type-k q8_0 --cache-type-v q8_0

#Gemma 4 31B (Dense)

Target: gemma-4-31B-it-UD-Q8_K_XL.gguf (33 GB). DFlash drafter: gemma4-31b-it-dflash-IQ4_XS.gguf (~836 MB).

Phase	Test	TG
Baseline (stock llama.cpp)	short-512	6.19 t/s
Bee DFlash	short-512	16.51 t/s
Baseline (stock llama.cpp)	long-256	6.21 t/s
Bee DFlash	long-256	12.46 t/s

Speedup: 2.67x on short, 2.01x on long.

#Qwen 3.6 27B (Dense)

Target: Qwen3.6-27B-UD-Q8_K_XL.gguf (33 GB). DFlash drafter: Qwen3.6-27B-DFlash-Q4_K_M.gguf (~1 GB).

Phase	Test	TG
Baseline (stock llama.cpp)	short-512	6.46 t/s
Bee DFlash	short-512	16.03 t/s
Baseline (stock llama.cpp)	long-256	6.48 t/s
Bee DFlash	long-256	12.03 t/s

Speedup: 2.48x on short, 1.86x on long.

Both dense models go from ~6 t/s baseline to 12-16.5 t/s with DFlash. That’s a real, consistent improvement on the bandwidth wall that dominates Strix Halo inference.

#DFlash vs MTP: The Head-to-Head

The interesting question isn’t “is DFlash faster than baseline?” (it is), it’s “is DFlash faster than what I’m already running?” For Qwen 3.6, I ran a strict comparison using the same harness, same prompts, same flags, same port, swapping only the server binary and model between phases.

#Qwen 3.6 27B, Three-Way Comparison

Method	Short-512 TG	Long-256 TG	vs Baseline
Baseline (stock)	6.46 t/s	6.48 t/s	-
Bee DFlash (Q8 target + DFlash drafter)	16.07 t/s	12.05 t/s	2.5x / 1.9x
MTP Q8 (Q8_0-mtp, draft-mtp n-max 5)	20.85 t/s	16.73 t/s	3.2x / 2.6x
MTP Q4 (Q4_K_M-mtp, draft-mtp n-max 5)	19.82 t/s	20.18 t/s	3.1x / 3.1x

MTP vs Bee DFlash:

Test	MTP Q8 vs Bee	MTP Q4 vs Bee
short-512	+29.7% faster	+23.4% faster
long-256	+38.8% faster	+67.5% faster

MTP wins across the board, with the gap widening on longer generation runs.

#Gemma 4 31B Comparison (Historical)

For Gemma 4, I don’t have the atomic fork binary currently built to do a strict same-session comparison, but my earlier MTP assistant benchmarks measured 22.9 t/s at block-size 5. Today’s Bee DFlash measured 16.5 t/s on a similar prompt. That’s roughly a 28% MTP advantage, consistent with the Qwen results.

#Updated Speculative Decoding Leaderboard

Adding DFlash to the running tally of every speculative method I’ve benchmarked on Strix Halo:

#Qwen 3.6 27B (Dense, ~27B)

Method	TG Speed	vs Baseline
MTP Q4_K_M (n-max 5)	~20 t/s	~3.1x
MTP Q8_0 (n-max 5)	~21 t/s	~3.2x
Bee DFlash (Q8 target)	~16 t/s	~2.5x
Cross-gen draft (Qwen 3.5 2B)	6.19 t/s	0%
ngram-mod	6.18 t/s	0%
Baseline	6.46 t/s	-

#Gemma 4 31B (Dense, ~31B)

Method	TG Speed	vs Baseline
MTP assistant (bs=5)	22.9 t/s	3.7x
Bee DFlash (Q8 target)	16.5 t/s	2.7x
E2B draft model (dm=4)	12.3 t/s	2.0x
Baseline	6.2 t/s	-

#Why MTP Wins on This Hardware

The result makes sense when you think about how each method uses the memory bus.

MTP (weight-baked) has the prediction heads inside the model GGUF. When the model streams its weights for a forward pass, the MTP heads come along for free, they’re part of the same sequential read. There’s zero additional memory bus contention from a second model. On Q4_K_M (15.4 GB), each forward pass streams less data and verifies multiple tokens, which is why Q4 MTP hits ~20 t/s on a ~6 t/s baseline.

DFlash runs a separate draft model (~1 GB) that cross-attends to the target’s hidden states. Even though 1 GB is small, the CPU ring buffer path on Vulkan adds latency to every draft-verify cycle. On CUDA, this overhead is hidden by a GPU-side ring buffer, but Vulkan doesn’t have that path. The result is that DFlash pays a per-cycle tax that MTP avoids entirely.

The MTP advantage would likely shrink on CUDA hardware where DFlash’s GPU ring is active. The BeeLlama benchmarks on an RTX 3090 show 4-5x speedups (vs 2-2.7x on Vulkan here), which would put DFlash much closer to MTP territory.

#What DFlash Does Better

DFlash has some practical advantages over MTP that the raw throughput numbers don’t capture:

Standard target GGUFs: DFlash works with any unmodified model GGUF. MTP requires either specially converted MTP GGUFs (Qwen) or fork-specific assistant GGUFs (Gemma). If a new model drops tomorrow, DFlash drafters can be created without modifying the target model at all.
Adaptive draft depth: BeeLlama’s profit controller adjusts draft depth at runtime based on acceptance rates. MTP uses a fixed --spec-draft-n-max.
Reasoning-loop protection: The fork detects repeated hidden reasoning output and intervenes, which is genuinely useful for long agentic tasks.
Single fork: Both Qwen and Gemma DFlash run through the same binary. My MTP setup currently requires two different forks (am17an for Qwen, atomic for Gemma).

#Limitations

Vulkan is the wrong backend for DFlash. The CPU ring fallback on Vulkan leaves significant performance on the table. On a CUDA system the gap vs MTP would be much smaller or possibly reversed.
--parallel 1 required: Same as MTP, single-slot only.
No TurboQuant on Vulkan: The fork’s advanced KV cache compression (turbo3_tcq, etc.) is CUDA-only. Standard cache types work fine.
Draft model ecosystem: DFlash drafters are currently published for Qwen 3.6 27B and Gemma 4 31B. Coverage will depend on the project maintaining drafter GGUFs for new model families.

#Bottom Line

BeeLlama DFlash is a solid 2-2.7x speedup over baseline on Strix Halo’s Vulkan path. If you don’t have MTP set up, or if you’re on a model family that doesn’t have MTP GGUFs, DFlash is the easier path to faster dense model inference, one fork, one draft GGUF download, two extra flags.

But if you already have MTP working (as I do for both Qwen and Gemma), there’s no reason to switch. MTP is 23-67% faster in practice on this hardware, and the gap comes from a fundamental architectural advantage: MTP heads ride along with the target model’s weight reads instead of competing for the memory bus through a separate model and a CPU ring buffer.

I’ll revisit this if BeeLlama adds native ROCm ring support or if I get access to a CUDA system where DFlash can use its GPU ring path. The project is actively developed and the DFlash approach has clear potential, it just needs the right backend to show its full strength.