ATMA: Length-Invariant Language Modeling via Polar Attention and Gated-Delta Compression Memory (opens in new tab)
Modern large language models based on softmax scaled-dot-product attention are constrained by their training sequence length: as the key-value sequence grows, softmax probability mass can dilute across a wider distribution, inducing activation shift and long-context performance collapse. Moreover, long-context language modeling faces a structural tension: a sliding-window attention core maintains a bounded local representation and low perplexi...
Read the original article