🖥️GPU Computing morphllm.comBlogTutorial

Optimizing Models to Be Fast at Codegen (opens in new tab)

Covers KernelBench: Can LLMs Write Efficient GPU Kernels?Discussed on Hacker News

Three places the open inference stack quits, and what we build past each. A speculator trained on the model's own diffs: a generic draft gets 1.93x, a trained one 3.07x. An autoresearch loop for kernels on $7K GPUs: 97 to 162 tok/s. A prefix cache that crosses NVLink-denied boxes over plain TCP.

Read the original article