Prompt processing vs. generation: two phases, opposite bottlenecks (opens in new tab)

Covers Efficient Memory Management for Large Language Model Serving with PagedAttentionDiscussed on Hacker News

The same machine can rip through generating tokens yet crawl when it reads a long prompt — or the reverse. That's because local LLMs run in two phases with opposite bottlenecks. Understand them and you'll know exactly which hardware spec to buy for.

Read the original article