Diffusion Policy Optimization without Drifting Apart (opens in new tab)

Covered by DEV Community

RL post-training has become increasingly pivotal for improving diffusion policies, but existing diffusion policy-gradient methods are often unstable and cannot achieve reliable policy improvement. We identify the cause as the double-drift phenomenon: optimizing a variational surrogate can let the ELBO separate from the true log-likelihood, which then makes the resulting proxy policy gradient misaligned with the true policy gradient of expected r...

Read the original article

Sign in to keep reading the full article.

Sign Up Log In

Covered in 1 article

DEV Community·

FutureX · Physical AI Daily — Issue 29 (06/16)

Discussed on DEV