What: The FlashMemory-DeepSeek-V4 paper introduces Lookahead Sparse Attention (LSA) — decoding very long context without loading the whole KV cache, by training a small Neural Memory Indexer to predict which chunks of the cached past a token will actually use. Why: At long context the binding cost is memory, not math: the KV cache grows with every token until it dominates GPU serving memory, so LSA cuts the physical cache footprint to 13.5% of the full version while nudging accuracy up 0.6%. ...

Read the original article