Why Your PyTorch Training Crawls on a Beefy GPU (And How to Fix It) (opens in new tab)

Covers Dao-AILab/flash-attentionDiscussed on DEV

Last month I was helping a friend debug a training loop that was running at maybe 15% GPU utilization on an A100. Fifteen percent. On a card that costs more than my first car. He'd already tried bumping the batch size, swapping the optimizer, and rewriting the data loader — nothing moved the needle. This is one of those frustrating problems where the obvious knobs do nothing, because the obvious knobs aren't where the bottleneck lives. So let's actually walk through how to figure out why your...

Read the original article