Inference Engineering

Feeds to Scour
SubscribedAll
Scoured 344 posts in 18.2 ms

RKSC: Reasoning-Aware KV Cache Sharing and Confident Early Exit for Multi-Step LLM Inference

 💾KV Cache  Content type: Academic
arxiv.org·

libertywing/FlashMemory-Deepseek-V4: FlashMemory DS-V4 Retriever: a lightweight retriever that sparsifies DeepSeek-V4 CSA KV-cache. Weights available on Hugging Face.

 ⏱️Prefill Decoding  Content type: Code
github.com·

Fixing a stuck Ollama runner and building a GPU watchdog

 🧵Warp Scheduling

How we fight GPU scarcity without compromise

 💾KV Cache  Content type: Blog
equixly.com··Hacker News

2x GH200 for LLM inference, Part 2: vLLM, DeepSeek V4 Flash, and MTP

 💾KV Cache  Content type: Blog
dnhkng.github.io·

DiffusionGemma: 4x Faster Text Generation

 🎮GPU Computing  Content type: News  Content type: Blog

Using Scikit-LLM with Open-Source LLMs

 ⚙️ML Compilers

Ollama 0.30 delivers faster NVIDIA GPU performance and wider hardware support

 🗜️Quantization
alternativeto.net·

AMD's Lemonade SDK For Local AI Adds NVIDIA CUDA Support

 🎮GPU Computing
phoronix.com·

A system programmer’s guide to LLM inference

 💰Inference Cost  Content type: Blog

Speculators v0.5.0: DFlash support and online training

 🚀Speculative Decoding
developers.redhat.com·

High Bandwidth Flash | A New Memory for AI Data Centers and Edge Computing | Sandisk

 🧠HBM Bandwidth
ncnonline.net·

Tales of an Ollama Honeypot (Part 3): More Traffic, More Findings

 🔭Observability
posts.inthecyber.com·

NVIDIA Nemotron 3 Ultra

 ⚗️Kernel Fusion  Content type: Blog
ollama.com·

Neo-X7/Neo-AI: A fully offline AI assistant powered by Ollama. Stores and retrieves conversations using SQLite + LanceDB vector search. No cloud. No API keys. Runs entirely on your machine.

 ⚙️MLOps  Content type: Code
github.com··DEV

From GPU to Token: The 8-Layer Observability Stack for AI Infrastructure

 💰Inference Cost  Content type: Blog
jimmysong.io·

Alignment Collapse Under KV Cache Quantization: Diagnosis and Mitigation

 💰Inference Cost  Content type: Academic
arxiv.org·

Running Qwen 35B MoE at 450k Context on a Single 32GB GPU

 💰Inference Cost

Two Leaps to 1000 Tokens/s on a 1T-Parameter Model: On Inference Systems, Execution Boundaries, and Co-Design

 💰Inference Cost  Content type: Blog
tilert.ai··Hacker News

MLPerf and the rise of latency-aware LLM benchmarking

 ⏱️Prefill Decoding
edn.com·

Keyboard Shortcuts

Navigation

Next / previous item
j/k
Open post
oorEnter
Preview post
v

Post Actions

Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s

Recommendations

Add interest / feed
Enter
Not interested
x

Go to

Home
gh
Interests
gi
Feeds
gf
Likes
gl
History
gy
Changelog
gc
Settings
gs
Browse
gb
Search
/

General

Show this help
?
Submit feedback
!
Close modal / unfocus
Esc

Press ? anytime to show this help