💾 KV Cache - nayyara.airlangga · Scour

heterodoxin/graphkv: Graph-guided KV cache compression for memory-efficient LLM inference.

⏱️Prefill Decoding Code

github.com··r/LocalLLaMA

RKSC: Reasoning-Aware KV Cache Sharing and Confident Early Exit for Multi-Step LLM Inference

🧠Inference Engineering Academic

google/gemma-4-31B-it · fix: chat template — null handling, reasoning preservation, turn-tag balance, input validation

🧠Inference Engineering

huggingface.co··r/LocalLLaMA

Inferoa AI harness claimed 90% cache savings. We ran it and measured 97.8%

🧠Inference Engineering

zozo123.github.io··Hacker News

DeepSeekV4 1.6T Day 0 to Day 43 Performance Over Time - Huawei, GB300 NVL72, MI355X, B200

🧠Inference Engineering News

newsletter.semianalysis.com

··Hacker News

How we fight GPU scarcity without compromise

💰Inference Cost Blog

equixly.com··Hacker News

NVIDIA Accelerates Google DeepMind’s DiffusionGemma for Local AI

🎮GPU Computing Blog

blogs.nvidia.com·

2x GH200 for LLM inference, Part 2: vLLM, DeepSeek V4 Flash, and MTP

🧠Inference Engineering Blog

dnhkng.github.io·

DiffusionGemma: The Developer Guide

🧠Inference Engineering Blog

developers.googleblog.com·

Speculators v0.5.0: DFlash support and online training

🚀Speculative Decoding

developers.redhat.com·

Report: GKE Inference Gateway delivers up to 92% faster AI responses

🧠Inference Engineering Blog

cloud.google.com··Hacker News

Less-relevant results

AMD's Lemonade SDK For Local AI Adds NVIDIA CUDA Support

🎮GPU Computing

PagedAttention vs Traditional KV Cache: How vLLM Reinvented GPU Memory for LLM Inference

⏱️Prefill Decoding Blog

·

Making Local LLM Go Brrr

⏱️Prefill Decoding

seanpedersen.github.io·

Breaking the Ice: Analyzing Cold Start Latency in vLLM

🧠Inference Engineering Academic

arxiv.org··Hacker News

harshuljain13/llm-inference-at-scale: A Practitioner handbook for production llm serving.

🧠Inference Engineering Code

github.com··Hacker News

Machinic Psychopharmacology: Do LLMs Self-Medicate?

🧠Inference Engineering

lesswrong.com··Hacker News

Running LLM Inference on Kubernetes: What It Actually Takes

🧠Inference Engineering Blog

fairwinds.com·

Token4Token — pay-per-token inference on Gnosis + Swarm

🧠Inference Engineering

t4t.eth.link··Hacker News

How to Run Gemma 4 12B Locally - The Best AI For Consumer Laptops

🧠Inference Engineering Video

Log in to enable infinite scrolling