⚙️Systems Programming thecomputersciencebook.com

PagedAttention is more than virtual memory (opens in new tab)

Covers Efficient Memory Management for Large Language Model Serving with PagedAttentionDiscussed on Hacker News

Everyone repeats the analogy that PagedAttention is virtual memory for the KV cache. But I think the more interesting story is what comes along with it: copy-on-write, swapping, thrashing, a missing MMU, a timing side-channel that leaks other users' prompts, and new ways to optimise serving LLMs.

Read the original article