lemmy.ml

Luce DFlash: Qwen3.6-27B at up to 2x throughput on a single RTX 3090 (opens in new tab)

submitted by yogthos to technology4 points | 0 comments GGUF port of DFlash speculative decoding. Standalone C++/CUDA stack on top of ggml, runs on a single 24 GB RTX 3090, hosts the new Qwen3.6-27B. ~1.98x mean over autoregressive on Qwen3.6 across HumanEval / GSM8K / Math500, with zero retraining. If you have CUDA 12+ and an NVIDIA GPU like RTX 3090 / 4090 / 5090, then all you need to do is clone the repo cd lucebox-hub/dflash cmake -B build -S . -DCMAKE_BUILD_TYPE=Release cmake --build bui...

Read the original article
Sign in to keep reading the full article.

Keyboard Shortcuts

Navigation

Next / previous post
j/k
Open post
oorEnter
Preview post
v

Post Actions

Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s

Recommendations

Add interest / feed
Enter
Not interested
x

Go to

Home
gh
Interests
gi
Feeds
gf
Likes
gl
History
gy
Changelog
gc
Settings
gs
Discover
gb
Search
/

General

Show this help
?
Submit feedback
!
Close modal / unfocus
Esc

Press ? anytime to show this help