IndexCache: Accelerating Sparse Attention via Cross-Layer Index Reuse (opens in new tab)

Covered by 3 sources including huggingface.co, GitHub

arXiv:2603.12201v1 Announce Type: new Abstract: Long-context agentic workflows have emerged as a defining use case for large language models, making attention efficiency critical for both inference speed and serving cost. Sparse attention addresses this challenge effectively, and DeepSeek Sparse Attention (DSA) is a representative production-grade solution: a lightweight lightning indexer selects the top-k most relevant tokens per query, reducing core attention from $O(L^2)$ to $O(Lk)$. Howe...

Read the original article

Sign in to keep reading the full article.

Sign Up Log In

Covered in 4 articles

huggingface.co·

GLM-5.2: Built for Long-Horizon Tasks

Discussed on Hacker News and r/LocalLLaMA

huggingface.co·

zai-org/GLM-5.2 is here!

Discussed on Hacker News, Hacker News, and r/LocalLLaMA

GitHub·

zai-org/GLM-5

View all 4 ›