Testing llama.cpp PR #21344: Faster MoE Prefill, but MTP Fights Back (opens in new tab)

Covers Model swapping with vLLM

A community PR optimizing CUDA kernels for GFX1151 delivers +24% prefill throughput on MoE models, but combining those same kernel changes with MTP speculative decoding makes inference slower. Not every optimization stacks.

Read the original article