By Yujun (Lucas) Qian In collaboration with William Baisi, KangHyuk Lee, and Anshul Sadh-Gauri (Columbia University). Mentored by Dr. Garrett Goon.

4 min read12 hours ago

The “Straggler” Problem in Large Scale Inference

As Large Language Models (LLMs) continue to grow in context length (now routinely handling 128k or even 1M tokens), serving them efficiently has become a major systems challenge. While techniques like FlashAttention and Ring Attention have revolutionized memory management by splitting the Key-Value (KV) cache across multiple GPUs, they have a hidden limitation: they implicitly assume hardware homogeneity.

In a perfect world, every GPU in a cluster is identical. In the real world, data centers evolve. Partial upgrades, cost co…

Similar Posts

Loading similar posts...

Keyboard Shortcuts

Navigation
Next / previous item
j/k
Open post
oorEnter
Preview post
v
Post Actions
Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Recommendations
Add interest / feed
Enter
Not interested
x
Go to
Home
gh
Interests
gi
Feeds
gf
Likes
gl
History
gy
Changelog
gc
Settings
gs
Browse
gb
Search
/
General
Show this help
?
Submit feedback
!
Close modal / unfocus
Esc

Press ? anytime to show this help