Dissecting ThunderKittens: Anatomy of a Compact DSL for High-Performance AI Kernels (opens in new tab)

Discussed on Lobsters

Introduction Modern ML workloads depend heavily on custom GPU kernels. Even when a model is expressed as clean tensor operations, the performance almost a...

Read the original article