🗄️ KV Cache - ghosh.debasish · Scour

KV Cache Optimization: 3x Faster LLM Inference on 24GB VRAM 🧠LLMs

tildalice.io·6d

Understanding KV Cache: The Hidden Memory Cost of Serving LLMs 🧮Cache-Oblivious Algorithms

melchi.me·1d·Hacker News

LLM Inference 🔤PLT

iop.systems·2h

KV Cache and Flash Attention with interactive diagrams 🧮Cache-Oblivious Algorithms

kvcache.cobanov.dev·10h·Hacker News

SuperInfer: SLO-Aware Rotary Scheduling and Memory Management for LLM Inference on Superchips 🖥️Systems Programming

supercomputing-system-ai-lab.github.io·2d·Hacker News

InferenceBench: A Benchmark for Open-Ended Inference Optimization by AI Agents 🧠Reasoning Models

inferencebench.ai·6h·Hacker News

The Inference Bottleneck: Architecting Kubernetes Autoscaling for Production LLMs 🧠Reasoning Models

cloudnativenow.com·5d

KV Cache Is Becoming the Memory Hierarchy of Inference 🧠Reasoning Models

touchdown-labs.com·2d

GPU Memory Math for LLMs: Formula That Tells You What Fits on Your GPU 🖥️Systems Programming

theahmadosman.substack.com·8h·Substack, r/LocalLLaMA

Ollama Doesn't Know Its GPU Is on Another Machine 🦎Zig Allocators

loopholelabs.io·15h·Hacker News

2.3x KV Cache Compression at 32k Context 🛢️Database Internals

github.com·6d·Hacker News

Building a Controllable Inference Platform on Kubernetes with AI Runway 🧠Reasoning Models

techcommunity.microsoft.com·2d

Announcing OpenAI-compatible API support for Amazon SageMaker AI endpoints 🧠Reasoning Models

aws.amazon.com·5h

Eliminate LLM Cold starts: Load models up to 6x Faster with Azure Blob Storage and Run:AI Model Streamer 💾Storage Engines

devblogs.microsoft.com·1d

I built a catalog of portable AI capability packs for coding agents. Is this useful or too abstract? 📊LLM Evaluation

doramagic.ai·17h·r/SideProject

Let AI Agents Write Your Serving Stack with VibeServe 🧠Reasoning Models

syfi.cs.washington.edu·6d·Hacker News

OSCAR: Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization 🧮Cache-Oblivious Algorithms

I replaced GitHub Copilot with a self-hosted AI and I won’t go back ⚡Zig

xda-developers.com·10h

AMD says its $4K Ryzen AI Halo workstation practically pays for itself 🦎Zig Allocators

theregister.com·5h

LLM Observability with Self-Hosted Langfuse and vLLM 📐Linearizability

pyimagesearch.com·2d

Log in to enable infinite scrolling