The KV Cache, Explained: Why Long Context Eats Your VRAM (and How to Fit More) (opens in new tab)
You quantized the model and it fits — then it runs out of memory at long context. The culprit is the KV cache, and at 128k tokens it can dwarf the model itself. Here's the math, the fix, and what it means for buying local-LLM hardware.
Read the original article