KV Cache

Feeds to Scour
SubscribedAll
Scoured 166 posts in 7.0 ms

The Sequence AI of the Week #875: Why Your Language Model Needs a Nap

 💰Compute Costs  Content type: News  Content type: Blog

CommBench: Can LLMs Write Correct and Efficient GPU Communication Code?

 🖥️Inference Engineering

Latest technical articles & videos.

 🎯Fine-tuning
certdepot.net·

Gemma 4 QAT models: Optimizing model compression for mobile and laptop efficiency

 🖥️Inference Engineering  Content type: News  Content type: Blog
blog.google··Hacker News

From GPU to Token: The 8-Layer Observability Stack for AI Infrastructure

 🖥️Inference Engineering  Content type: Blog
jimmysong.io·

[AINews] Open Models, Model Labs vs Agent Labs, and What's Untrainable — Sarah Guo

 🖥️Inference Engineering  Content type: News
latent.space
·

Running Qwen 35B MoE at 450k Context on a Single 32GB GPU

 🖥️Inference Engineering

Alignment Collapse Under KV Cache Quantization: Diagnosis and Mitigation

 🖥️Inference Engineering  Content type: Academic
arxiv.org·

Nvidia DGX Spark GB10 – AI Models and Guide with vLLM and Autonomous Script

 🖥️Inference Engineering  Content type: Code
github.com··Hacker News

Machinic Psychopharmacology: Do LLMs Self-Medicate?

 🖥️Inference Engineering

Two Leaps to 1000 Tokens/s on a 1T-Parameter Model: On Inference Systems, Execution Boundaries, and Co-Design

 🖥️Inference Engineering  Content type: Blog
tilert.ai··Hacker News

Show HN: Taliesin – bit-exact KV-cache restore, 21x faster, cross-GPU verified

 🖥️Inference Engineering  Content type: Blog
medium.com
··Hacker News

WEKA software speeds long context AI inferencing on Oracle’s public cloud

 🖥️Inference Engineering  Content type: News
blocksandfiles.com·

Where to Host Your Open-Source Model (Under 10B Parameters)

 🖥️Inference Engineering
digitalocean.com·

Building & Benchmarking: LLMs on a 16GB Jetson Orin NX for Hermes Agent

 🖥️Inference Engineering  Content type: Blog
dnhkng.github.io·

heterodoxin/graphkv: Graph-guided KV cache compression for memory-efficient LLM inference.

 🖥️Inference Engineering  Content type: Code
github.com··r/LocalLLaMA

Google's new open model DiffusionGemma generates text from noise instead of word by word

 🖥️Inference Engineering
the-decoder.com
·

Issue #390 - The ML Engineer 🤖

 🖥️Inference Engineering  Content type: News  Content type: Blog

Massive AI Storage Demand Creates a New Memory Wall

 🖥️Inference Engineering  Content type: News
eetimes.com·

Anatomy of a high-performance EP kernel

 🖥️Inference Engineering  Content type: Blog

Keyboard Shortcuts

Navigation

Next / previous item
j/k
Open post
oorEnter
Preview post
v

Post Actions

Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s

Recommendations

Add interest / feed
Enter
Not interested
x

Go to

Home
gh
Interests
gi
Feeds
gf
Likes
gl
History
gy
Changelog
gc
Settings
gs
Browse
gb
Search
/

General

Show this help
?
Submit feedback
!
Close modal / unfocus
Esc

Press ? anytime to show this help