Speculative Decoding

Feeds to Scour
SubscribedAll
Scoured 118 posts in 5.8 ms

Google's new open model DiffusionGemma generates text from noise instead of word by word

 🎮GPU Computing
the-decoder.com
·

K-Forcing: Joint Next-K-Token Decoding via Push-Forward Language Modeling

 🔢GEMM Optimization  Content type: Academic
arxiv.org·

Jason McDonald

 FlashAttention

Youssof Altoukhi (@Youssofal_)

 🧠Inference Engineering
xcancel.com··r/LocalLLaMA

Amy Adams Brings Dario Vitale’s Versace Style to ‘The Tonight Show’

 FlashAttention  Content type: News
wwd.com
·

GGUF vs GPTQ vs AWQ: The Plain-English Guide to LLM Quantization (and Which One to Pick)

 🗜️Quantization

mirkolenz/llmhop: Tiny, stateless Go router that dispatches OpenAI-compatible requests to single-model vLLM and sglang backends with zero external dependencies

 💾KV Cache  Content type: Code
github.com··Hacker News

a local Windows app for interview prep and mock practice

 🚀Model Serving
ofarwise.com··Hacker News

Qwen 3.6 27B AutoRound GGUF, need your feedback

 💰Inference Cost

Build a Medical Report Analyzer on Dedicated Inference with Python

 🧠Inference Engineering
digitalocean.com·

Castlevania: Belmont's Curse release date confirmed on October 15, Japanese voice cast list also revealed

 🧵Warp Scheduling
rpgsite.net·

Review: The Boy with the Light-Blue Eyes - SXSW London 2026

 Triton
cineuropa.org·

B & S About Movies podcast Episode 140: The Sons of Hercules

 🔢GEMM Optimization
bandsaboutmovies.com·

New rumour claims with '100%' confidence that AMD's next-gen Zen 6 desktop CPU will run at over 6.5 GHz

 🧠HBM Bandwidth  Content type: News
pcgamer.com
·

heterodoxin/graphkv: Graph-guided KV cache compression for memory-efficient LLM inference.

 💾KV Cache  Content type: Code
github.com··r/LocalLLaMA

the sissy boy

 🚀Model Serving  Content type: Blog
blog.hyeonje.website·

Gemma 4 QAT models: Optimizing model compression for mobile and laptop efficiency

 💰Inference Cost  Content type: News  Content type: Blog
blog.google··Hacker News

Clairvoyant: Predictive SJF Scheduling to Mitigate Head-of-Line Blocking in Serial LLM Backends

 🧠Inference Engineering  Content type: Academic
arxiv.org·

Machinic Psychopharmacology: Do LLMs Self-Medicate?

 💾KV Cache
lesswrong.com··Hacker News

Barbara Gladstone Living Room

 🧠HBM Bandwidth
greg.org·
Sign up or log in to see more results

Keyboard Shortcuts

Navigation

Next / previous item
j/k
Open post
oorEnter
Preview post
v

Post Actions

Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s

Recommendations

Add interest / feed
Enter
Not interested
x

Go to

Home
gh
Interests
gi
Feeds
gf
Likes
gl
History
gy
Changelog
gc
Settings
gs
Browse
gb
Search
/

General

Show this help
?
Submit feedback
!
Close modal / unfocus
Esc

Press ? anytime to show this help