The Reward Was in Your Data All Along: Correcting Flow Matching with Discriminator-Guided RL (opens in new tab)
Score- and flow-matching models often rely on preference-based reinforcement learning for two purposes: aligning with subjective preferences and, surprisingly, recovering properties such as visual realism and coherent object structure that matching-based training is intended to learn from the data itself. We argue that this reflects a structural mismatch. Matching losses measure $\ell_2$ regression error on the velocity or score field under trai...
Read the original article