The no BS Guide to implementing reasoning models from scratch with SFT & RL
25 min readJust now
–
Press enter or click to view image in full size
Image By Author
In Part 1 of this series, we laid the groundwork for understanding how reasoning large language models (LLMs) can be built from first principles using PyTorch. We explored core transformer architecture enhancements like Multi-Query Attention, Grouped Query Attention (GQA), Mixture of Experts (MoE), and implemented them from scratch.
These architectural upgrades have no doubt enabled capabilities and efficiency far beyond traditional transformer models. But this alone does…
The no BS Guide to implementing reasoning models from scratch with SFT & RL
25 min readJust now
–
Press enter or click to view image in full size
Image By Author
In Part 1 of this series, we laid the groundwork for understanding how reasoning large language models (LLMs) can be built from first principles using PyTorch. We explored core transformer architecture enhancements like Multi-Query Attention, Grouped Query Attention (GQA), Mixture of Experts (MoE), and implemented them from scratch.
These architectural upgrades have no doubt enabled capabilities and efficiency far beyond traditional transformer models. But this alone does not induce reasoning capabilities. To make LLMs useful, aligned, and capable of real-world reasoning, there’s usually a post-training pipeline that shapes model behavior far beyond generic pre-training. Although the orders followed vary across models and companies, the key stages involve supervised fine-tuning (SFT) and reinforcement learning (RL).
In this installment, we’ll walk through how to implement both SFT and RL from the ground up, showing not just what these techniques do, but how they work under the hood. By the end, you’ll understand the mechanics of fine-tuning and learning through reward signals, without leaning on high-level libraries.