Loop Unrolling in the ML Era (opens in new tab)
If you have a massive compute architecture—whether it’s a modern wide-SIMD vector engine, a Tensor Core array, or a custom deep learning accelerator like a Systolic Array—you face one fundamental problem: feeding the beast. You have immense execution width, but if your instructions are bottlenecked by branch overhead and short basic blocks, those execution units sit idle.
Read the original article