ParallelKernelBench: Frontier LLMs can't write fast multi-GPU kernels (yet) (opens in new tab)

Covers 3 stories including NVIDIA/TensorRT-LLM

ParallelKernelBench tests whether LLMs can write fast multi-GPU CUDA kernels across 87 real workloads. The best model solves under a third, but a few generated kernels beat any public implementation.

Read the original article