Benchmarking llama.cpp's brand-new MTP support on Strix Halo (opens in new tab)

Covers 2 stories including llama + spec: MTP Support by am17an · Pull Request #22673Discussed on Hacker News

After llama.cpp merged Multi-Token Prediction (MTP) speculative decoding support, I benchmarked Qwen3.6 27B and 35B-A3B on Strix Halo and an RTX 3090. Up to 2.44× speedup, lossless output, build-from-source steps included.

Read the original article