PolyKV: Heterogeneous Retention and Allocation for KV Cache Compression (opens in new tab)

KV cache compression is essential for reducing the memory cost of long-context large language model inference. Existing approaches, however, typically apply a single compression policy and a uniform cache budget across all transformer layers. This uniform design ignores the fact that different layers can play different roles during prefill and decoding, and may therefore require different eviction strategies and cache capacities. We present Poly...

Read the original article