Sparse KV Caches Cut Attention Scaling (opens in new tab)
Sparse key‑value caches collapse the quadratic blow‑up of softmax attention into a cost that grows near‑linearly with sequence length. By making each query attend to a tiny, top‑k subset of blockwise KV memories, the per‑query work stops scaling with the full context. This tiny change flips the scalability curve for ultra‑long sequences and makes multi‑hundred‑kilobyte windows practical on a single GPU. Before this work, the dominant recipe was dense attention, whose (O(N^{2})) memory and FLO...
Read the original article