What exactly is Flash Attention? (opens in new tab)
The IO-aware trick that lets attention skip the giant N x N score matrix entirely, making memory linear in sequence length and the kernel…
Read the original articleThe IO-aware trick that lets attention skip the giant N x N score matrix entirely, making memory linear in sequence length and the kernel…
Read the original article