My First Multi-GPU Kernel: Writing All-to-All for AMD MI300X
gau-nernst.github.io·7h·
Discuss: Hacker News
Flag this post

Nov 2, 2025

Last month, I participated in the AMD Distributed Challenge, hosted by GPU MODE. This was very exciting for me as it was the first time I learned how to write a multi-GPU kernel! Although I had a brief understanding of how DDP and FSDP worked under the hood via collective primitives like all-reduce and reduce-scatter, I didn’t know it was possible to perform remote memory access directly inside a kernel! It opens up a lot of opportunities for multi-GPU optimizations, especially overlapping compute with inter-GPU communications.

This blog post is structured as my worklog on the 1st problem - All-to-All kernel. You can see the full problem description, including the reference kernel, at [gpu-mode/reference-kernels](https://github.com/gpu-…

Similar Posts

Loading similar posts...