This is the fourth and final piece in my series on reinforcement learning. Previously, we covered classical RL, continuous control, and off-policy methods. The topic of LLM post-training is discussed all over X, so this primer should help anyone get up to speed.
Here’s how I like to think about post-training methodologies:
SFT is simple. It’s just applying additional training iterations like the pre-training stage, but on a curated set of ideal (prompt, response) pairs. You might make this more efficient with a LoRA adapter.
In this post, we’ll focus on quadrant 2: DP…
This is the fourth and final piece in my series on reinforcement learning. Previously, we covered classical RL, continuous control, and off-policy methods. The topic of LLM post-training is discussed all over X, so this primer should help anyone get up to speed.
Here’s how I like to think about post-training methodologies:
SFT is simple. It’s just applying additional training iterations like the pre-training stage, but on a curated set of ideal (prompt, response) pairs. You might make this more efficient with a LoRA adapter.
In this post, we’ll focus on quadrant 2: DPO and offline GRPO. Along the way, I’ll point out how methods like online PPO and online GRPO fit in. Historically, online PPO came first, so understanding it helps explain DPO.
Before getting into the objective function of direct preference optimization (DPO), we need to motivate the idea of relative scoring.
We’re given lists of prompts x and pairwise responses a+ and a-:
((x, a_{+}, a_{-}))
All we know is that a+ is preferred to a-. A human may have rated them that way, or another signal might imply it (e.g. code that compiles > code that fails).
That setup doesn’t immediately lend itself to the methods we’ve seen so far. On-policy methods don’t work because neither response may be likely under the current policy. Off-policy methods still don’t work because we lack a defined reward.
You might try to make the model more likely to output a+ than a- by optimizing Pr[π(a+ | x) > π(a- | x)]. But that expression makes no sense. π(a | x) are constants given the model. Unless we add tunable parameters (say, some γ where fγ(π) produces a new model), those probabilities don’t change.
You could define fγ(π)(x, a) and optimize Pr[fγ(π)(x, a+) > fγ(π)(x, a-)], or build an even more general model fγ(π, x, a+, a-) that directly outputs the likelihood that a+ is better. But f is still abstract. It’s unclear how to parameterize it.
Instead, DPO (and the original online PPO post-training) take a simpler route by introducing a latent reward. The assumption is that if a human preferred a+ to a-, then there exists some implicit reward function r such that
(r(a_{+} | x) + ε_{+} > r(a_{-} | x) + ε_{-})
where ε represents human noise or ambiguity. If we can learn that reward function, we can optimize the model accordingly.
One approach is maximum likelihood estimation. We denote a+ ≻ a- if a+ is preferred. We’d like a function g such that:
(g(r_\phi(a_{+} | x), r_\phi(a_{-} | x)) = Pr[a_{+} ≻ a_{-}])
and then optimize φ to maximize:
(\ \prod Pr[a_{+} ≻ a_{-}]\ )
Let’s try to define g. Notice:
(\begin{align*} Pr[a_{+} ≻ a_{-}] &= Pr[r(a_{+} | x) + ε_{+} > r(a_{-} | x) + ε_{-}] \ &= Pr[r(a_{+} | x) − r(a_{-} | x) > ε_{-} − ε_{+}] \end{align*})
So preference depends only on the difference between rewards. That implies translational invariance: g(u, v) = g(u + c, v + c). That property implies that g must be f(r(a+ | x) - r(a- | x)) for some function f, since g(u, v) = g(u − v, 0) = f(u − v), where the first equality follows by translational invariance.
Second, if r(a++ | x) > r(a+ | x) > r(a- | x), the higher-reward response should never be less preferred. In other words, f must be non-decreasing: f’(x) >= 0
Finally, f(r(a+ | x) - r(a- | x)) + f(r(a- | x) - r(a+ | x)) = 1, which along with the previous condition, implies f(0) = ½, lim t→∞ f(t) = 1, and lim t→-∞ f(t) = 0.
Many functions f satisfy these conditions. The choice depends on what noise distribution you assume. In practice, DPO uses the logistic sigmoid, which assumes Gumbel noise:
(\begin{aligned} σ(x) &:= \frac{1}{1 + e^{-x}} \vphantom{\frac{1}{1 + e^{-x}}} \ U(a) &= r(a) + ε,\quad ε ∼ \text{Gumbel}(0, 1) \vphantom{\frac{1}{1 + e^{-x}}} \ \implies Pr[U(a_{+}) > U(a_{-})] &= σ(r(a_{+}) − r(a_{-})) \vphantom{\frac{1}{1 + e^{-x}}} \end{aligned})
If noise were Gaussian, you’d recover the probit model instead.
The final objective to optimize rφ is:
(\begin{align*} \max J(\phi) &= \sum \log \sigma(r_{\phi}(a_{+}) − r_{\phi}(a_{-})) \[4pt] \nabla J(\phi) &= \sum (1 − \sigma(r_{\phi}(a_{+}) − r_{\phi}(a_{-}))) [\nabla r_{\phi}(a_{+}) − \nabla r_{\phi}(a_{-})] \end{align*} )
Now we have a reward function. Just like the traditional methodology for REINFORCE, you can optimize your policy with respect to the objective function:
(\max_\theta J(\theta) = E_{\pi_{\theta}}[r(x, a)])
It’s a bit different from REINFORCE since there’s no discounted sum of rewards across a trajectory. Instead, it’s just a single-step reward that we’re optimizing with respect to. The problem with this approach is that it’s going to completely alter your model. The optimization will force the policy to output a+ with very high probability, at the cost of everything else.
So the actual optimization for online PPO and DPO actually adds a constraint to prevent the policy from diverging too much from the original policy, πref:
( \max_θ E_{\pi_{\theta}}[r(x, a)] \ \text{s.t. } KL(\pi_\theta | \pi_\text{ref}) < δ)
That KL divergence constraint might make you think of PPO. But that similarity is completely superficial. Recall that the KL divergence constraint for PPO came from rewriting the objective function:
(\begin{align*} \max_\theta J(θ) &= E_\tau[g(\tau)] = E_{x∼d^{π_{\text{old}}},a∼π_{\text{new}}}[A^{π_{\text{old}}}(x, a)] \end{align*})
We needed to constrain dπ_old ≈ dπ_new so we didn’t have to re-sample, and the best we could do was penalize KL(πnew || πold) and establish an upper bound on the divergence of the state distributions.
The KL divergence constraint for online PPO and DPO is not fundamentally justified in the same way. It is simply the heuristic notion that we want πnew to be not too different from πref. You could theoretically derive this constraint if you think the true model follows a Boltzmann distribution, and you impose πref as a prior. This leads to the same objective function as above. But that’s not really where this KL divergence constraint comes from.
If you’re running online PPO, you’ll see the KL divergence penalty in the objective function to keep the policy πnew close to πref, and a clipping mechanism to keep the policy πnew within the trust region of πold.
To finish the derivation for online PPO, we add the constraint that πθ(a | x) must sum to 1:
(\begin{align*} J(θ) &= \mathbb{E}_{\pi_\theta}[r(x, a)] - \frac{1}{β} , KL(\pi_\theta ,|, \pi_\text{ref}) - λ \left[\sum_a \pi_\theta(a | x) - 1\right] \[6pt] J(θ) &= \sum_a \pi_\theta(a | x) \left[r(x, a) - \tfrac{1}{β}\big(\log \pi_\theta(a | x) - \log \pi_\text{ref}(a | x)\big) - λ\right] - λ \end{align*} )
Taking the gradient:
(\begin{align*} \nabla J(θ) &= \sum_a \nabla \pi_\theta(a | x) \left[r(x, a) - \tfrac{1}{β}\big(\log \pi_\theta(a | x) - \log \pi_\text{ref}(a | x)\big) - λ\right] \[6pt] &\quad - \tfrac{1}{β} \sum_a \pi_\theta(a | x) , \nabla \log \pi_\theta(a | x) \end{align*} )
You can go ahead and optimize πθ using this gradient, and that’s exactly where methods like online PPO (or as we cover later, online GRPO) fit in.
DPO, on the other hand, attempts to turn this into a supervised learning problem, eliminating the need for rollouts or trajectories altogether. DPO starts by solving for the closed form solution of πθ. Note that πθ(a | x) * ∇log(πθ(a | x)) = ∇πθ(a | x) by the log-gradient trick, so that second summation sums to 1:
( \sum ∇π_θ(a|x)[r(x,a) - \tfrac{1}{β}(\log π_θ(a|x) - \log π_{\text{ref}}(a|x)) - λ] - \tfrac{1}{β} = 0)
Simplifying,
(\beta r(x,a) - [\log π_θ(a|x) - \log π_{\text{ref}}(a|x)] - λ - 1 = 0)
so:
( \log π_θ(a|x) = \beta r(x,a) + \log π_{\text{ref}}(a|x) + λ - 1)
and exponentiating gives:
(π_θ(a|x) = \frac{π_{\text{ref}}(a|x) e^{\beta r(x,a)}}{C(x)})
where C(x) is the normalization constant ensuring probabilities sum to one.
Then, DPO moves in the reverse direction, substituting this definition of πθ to express r:
(\ r(x, a)=\frac{1}{β}[ \log C(x)+\log π_θ(a | x)−\log π_\text{ref}(a | x) ])
Previously we solved this expression, which we’ll use for MLE:
(Pr[U(a_{+}) > U(a_{-})] = σ(r(a_{+}) − r(a_{-})))
Plugging in r:
(\begin{align*} Pr[U(a_{+}) > U(a_{-})] &= \sigma\Big(\frac{1}{\beta}[(\log \pi_{\theta}(a_{+} | x) − \log \pi_{\text{ref}}(a_{+} | x)) − (\log \pi_{\theta}(a_{-} | x) − \log \pi_\text{ref}(a_{-} | x))]\Big) \end{align*})
Then we maximize likelihood:
( \begin{align*} J(θ) &= ∑ \log \sigma\Big(\frac{1}{\beta}[(\log \pi_{\theta}(a_{+} | x) − \log \pi_\text{ref}(a_{+} | x)) − (\log \pi_{\theta}(a_{-} | x) − \log \pi_\text{ref}(a_{-} | x))]\Big) \end{align*})
That’s the final DPO objective. It can be optimized via standard supervised learning on your dataset. Choosing β controls the trade-off between imitation and divergence. But note that you no longer get additional signal beyond the dataset, unlike online PPO.
Now we reach the modern variant. GRPO was introduced by DeepSeek in 2024.
GRPO begins with the same pairwise setup as DPO. In fact, pairwise GRPO is mathematically identical to DPO, just rewritten.
To simplify notation, define:
(z(x):=\frac{1}{β}[(\log \pi_{\theta}(a_{+} | x)−\log \pi_\text{ref}(a_{+} | x))−(\log \pi_{\theta}(a_{-} | x)−\log \pi_\text{ref}(a_{-} | x))])
Then the objective function for DPO becomes:
(\begin{align*} J(θ) &= \sum \log σ(z(x)) \[4pt] \nabla J(θ) &= \sum (1 - σ(z(x))) , \nabla z(x) \[4pt] \nabla z(x) &= \frac{1}{β}\left[\nabla \log \pi_{\theta}(a_{+} \mid x) - \nabla \log \pi_{\theta}(a_{-} \mid x)\right] \end{align*} )
Define a shorthand:
(w(x,a_{+},a_{-}):=\frac{1−σ(z(x))}{β})
Then:
(∇J(θ)=∑ w(x,a_{+},a_{-})[∇ \log \pi_\theta(a_{+} | x) − ∇ \log \pi_\theta(a_{-} | x)])
Next, GRPO defines some synthetic reward function Ŕ:
(\hat{R}(a) := \begin{cases} +w(x, a_{+}, a_{-}), & a = a_{+} \[4pt] -w(x, a_{+}, a_{-}), & a = a_{-} \[4pt] 0, & \text{otherwise} \end{cases})
Then we might rewrite the gradient as:
(∇J(θ)=∑ \hat{R}(a) ∇ \log π_θ(a | x))
This is exactly the REINFORCE objective! This is the basic formulation for offline GRPO in the pairwise case. As you can see, all we did was make a few substitutions, but we didn’t fundamentally change the optimization. Thus, offline GRPO (pairwise) ≡ DPO ≡ REINFORCE in disguise.
So that’s the pairwise case. But the “group” in “group relative policy optimization” implies that you can have more than two responses. To be clear, with a1 > a2 > a3, you could decompose that into pairs (a1 > a2, a2 > a3, …), but GRPO treats the group as a first-class citizen.
Here’s where theory gets shaky. In DPO, the weights wi are strictly determined ±(1−σ(z))/β. GRPO merely observes that these weights satisfy ∑ wi = 0 and generalizes: any set of scores with ∑ wi = 0 is allowed.
The same supervised objective then applies:
(∇J(θ)=∑ \hat{R}(a_{i}) ∇ \log π_θ(a_{i} | x))
where Ŕ uses the custom group weights.
To adapt this to online GRPO, we reuse the same idea as online PPO. After generating k responses for a prompt, compute their scores, center them (subtract the mean), and treat those as the rewards.
So is GRPO broken? Many people report that it works for them empirically. But it’s fair to say that GRPO’s theoretical foundations are weaker than many other methods. I’ll end this with a take I posted about GRPO:
I hope to cover other RL/ML topics in future posts, but that concludes my blog series on reinforcement learning. Feedback is appreciated!