Testing llama.cpp PR #21344: Faster MoE Prefill, but MTP Fights Back
A community PR optimizing CUDA kernels for GFX1151 delivers +24% prefill throughput on MoE models, but combining those same kernel changes with MTP speculative decoding makes inference slower. Not every optimization stacks.
read more →