Inference

Feeds to Scour
SubscribedAll
Scoured 80 posts in 122.2 ms

KV cache quantization: what FP8/INT8 K and V actually buy you, and where they break

 Quantization  Content type: Blog
dev.to··DEV

MiMo-v2.5-Pro-UltraSpeed: 1T model with 1000 TPS

 Quantization  Content type: Blog

Report: GKE Inference Gateway delivers up to 92% faster AI responses

 ☁️GCP  Content type: Blog

huawei-csl/KVarN: KVarN is a native vLLM KV-cache quantization backend for your agents: 3-5x more context, throughput above FP16, and FP16-level accuracy. Calibration-free, one flag.

 🤖AI Inference  Content type: Code
github.com··Hacker News

Speculators v0.5.0: DFlash support and online training

 🔓Open Source AI
developers.redhat.com·

Gemma 4 QAT models: Optimizing model compression for mobile and laptop efficiency

 🔓Open Source AI  Content type: News  Content type: Blog
blog.google··Hacker News

Speculative Decoding: How LLMs Generate Tokens Faster Without Changing the Answer

 🤖AI Inference  Content type: Blog
dev.to··DEV

3x Faster Search: Parallel Test-Time Scaling with Instructed-Retriever-1

 📊Benchmarking  Content type: Blog
databricks.com·

The Prefill Wall: Why MTP's 2 Barely Moves Long-Context Latency (Qwen3.6-27B, RTX 3090)

 🤖Large Language Models  Content type: Blog
dev.to··DEV

The Death of the Four Golden Signals: Designing Telemetry for Non-Deterministic Infrastructure

 🔧SRE
devops.com·

KVarN, Cost.dev, headroom — the week the agent runtime bill got itemized

 🤖AI Inference  Content type: Blog
dev.to··DEV

The 4-layer voice-agent latency stack, traced with OTel spans

 🔭OpenTelemetry  Content type: Blog
dev.to··DEV
Less-relevant results

Doubling Qwen3.6-27B on One RTX 3090: ollama llama.cpp + MTP, Lever by Lever (35.7 80.2 tok/s)

 🖥️Local AI  Content type: Blog
dev.to··DEV

Why Self-Hosted Claude Code Was 15 Slower Than It Should Be

 🧠LLMs  Content type: Blog
dev.to··DEV

Why most LLM VRAM calculators are wrong on modern models (and an open-source MIT fix)

 🔓Open Source AI  Content type: Blog
dev.to··DEV

I Benchmarked 3 Local LLMs on My Laptop — Here's What the Numbers Actually Show

 🧠LLM  Content type: Blog
dev.to··DEV

NVIDIA and Apple Solved the Hardware. Here's What's Left to Build.

 Quantization  Content type: Blog
dev.to··DEV

Why TPUs Aren't Popular (Even Though They're Cheaper Per Token)

 🖥️GPU  Content type: Blog
dev.to··DEV

NVIDIA Showed an Agent Building Architecture on a Laptop

 🏢Architecture  Content type: Blog
dev.to··DEV

I Connected PewDiePie's Odysseus to a Cloud Memory Stack — Zero API Costs, Persistent Memory

 🧠LLM Tooling  Content type: Blog
dev.to··DEV

No more posts from buckman's subscribed feeds.

Keyboard Shortcuts

Navigation

Next / previous item
j/k
Open post
oorEnter
Preview post
v

Post Actions

Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s

Recommendations

Add interest / feed
Enter
Not interested
x

Go to

Home
gh
Interests
gi
Feeds
gf
Likes
gl
History
gy
Changelog
gc
Settings
gs
Browse
gb
Search
/

General

Show this help
?
Submit feedback
!
Close modal / unfocus
Esc

Press ? anytime to show this help