Back to article

Reasoning RL in 2026: GRPO, DPO, RLVR, Agentic PO & Beyond (opens in new tab)

Covers 2 stories including DeepSeekMath

Covers 2 related stories

DeepSeekMath

Discussed on Hacker News

[2305.18290] Direct Preference Optimization: Your Language Model is Secretly a Reward Model