llama + spec: MTP Support by am17an · Pull Request #22673

github.com · · Covered in 10 articles from 8 sources

Doubling Qwen3.6-27B on One RTX 3090: ollama llama.cpp + MTP, Lever by Lever (35.7 80.2 tok/s)

MTP Isn't Always a Win: 1.95x on My 3090, but Speculative Decoding Is Hardware-Dependent

bric.pe.kr··DEV

This Month in Agentic Coding – May 2026

agenticcodingweekly.com··Hacker News, Hacker News

Multi Token Prediction in llama.cpp

am17an.bearblog.dev·

AI's Plummeting Prices Are a Software Story, Not a Hardware One

weightythoughts.com··Hacker News

Three Months of Speed-Up Experiments on a 3090 Ti: Autoregressive DFlash MTP for Qwen3.6-27B

Benchmarking llama.cpp's brand-new MTP support on Strix Halo

calebcoffie.com··Hacker News

froggeric/Qwen3.6-27B-MTP-GGUF

huggingface.co··DEV

Used over a million tokens in three separate sessions to test Qwen 3.6 35b (new Multi-token Prediction version)

huggingface.co··r/LocalLLaMA

In other languages

Qwen3.6 MTP весит на 0.3 Гб больше, а даёт ускорение в ~2 раза. С 60 t/s до 130 t/s для Qwen3.6 27B без искажений