Testing llama.cpp PR #21344: Faster MoE Prefill, but MTP Fights Back (opens in new tab)
A community PR optimizing CUDA kernels for GFX1151 delivers +24% prefill throughput on MoE models, but combining those same kernel changes with MTP speculative decoding makes inference slower. Not every optimization stacks.
Read the original article