AI/ML Research Digest (opens in new tab)
Extreme KV‑Cache Compression and Long‑Context Efficiency Static quantization is giving way to rotation‑based and context‑sensitive schemes. OCTOPUS and OScaR reach near‑lossless INT2 performance while cutting cache size dramatically [1], [2]. Sparse token indexers replace dense caches with a searchable sketch, preserving attention fidelity at lower memory cost [3]. Linear‑attention decoupling splits the KV stream into a short‑term mutable part and a long‑term static part, keeping long‑context...
Read the original article