DEV Community

Doubling Qwen3.6-27B on One RTX 3090: ollama llama.cpp + MTP, Lever by Lever (35.7 80.2 tok/s) (opens in new tab)

A reader on my last post said Ollama was leaving a lot on the table — that a tuned backend with multi-token prediction (MTP) could roughly double my 3090's throughput. So I went and measured it, one lever at a time. The short version: they were right, the 2.25× is real, and below is the exact path that got me there on my box. TL;DR On a single RTX 3090, Qwen3.6-27B generation went from 35.7 tok/s (Ollama) to 80.2 tok/s (llama.cpp + MTP) — a measured 2.25× — by stacking three independent lever...

Read the original article
Sign in to keep reading the full article.

Keyboard Shortcuts

Navigation

Next / previous post
j/k
Open post
oorEnter
Preview post
v

Post Actions

Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s

Recommendations

Add interest / feed
Enter
Not interested
x

Go to

Home
gh
Interests
gi
Feeds
gf
Likes
gl
History
gy
Changelog
gc
Settings
gs
Discover
gb
Search
/

General

Show this help
?
Submit feedback
!
Close modal / unfocus
Esc

Press ? anytime to show this help