Dissecting ThunderKittens: Anatomy of a Compact DSL for High-Performance AI Kernels (opens in new tab)
Introduction Modern ML workloads depend heavily on custom GPU kernels. Even when a model is expressed as clean tensor operations, the performance almost a...
Read the original article