RLHF in 2026: when to pick PPO, DPO, or verifier-based RL (opens in new tab)
The famous InstructGPT result is still the cleanest argument for post-training: a 1.3B aligned model was preferred over the 175B GPT-3 base ~85% of the time on instruction-following. Alignment beat a 100x scale gap. That number got a lot of people to implement RLHF. Most of them later ripped it out and switched to DPO. A smaller group skipped both and went to verifier-based RL. This post is the decision tree I wish I'd had when I started: what each pipeline actually looks like in TRL, where i...
Read the original article