The KV Cache, Explained: Why Long Context Eats Your VRAM (and How to Fit More) (opens in new tab)

Covers 2 stories including Efficient Memory Management for Large Language Model Serving with PagedAttentionDiscussed on Hacker News

You quantized the model and it fits — then it runs out of memory at long context. The culprit is the KV cache, and at 128k tokens it can dwarf the model itself. Here's the math, the fix, and what it means for buying local-LLM hardware.

Read the original article