Expand and Prune: Maximizing Trajectory Diversity for Effective GRPO in Generative Models
arxiv.org·1d
📊Optimization
Preview
Report Post

View PDF HTML (experimental)

Abstract:Group Relative Policy Optimization (GRPO) is a powerful technique for aligning generative models, but its effectiveness is bottlenecked by the conflict between large group sizes and prohibitive computational costs. In this work, we investigate the trade-off through empirical studies, yielding two key observations. First, we discover the reward clustering phenomenon in which many trajectories collapse toward the group-mean reward, offering limited optimization value. Second, we design a heuristic strategy named Optimal Variance Filtering (OVF), and verify that a high-variance subset of trajectories, selected by OVF can outperform the larger, unfiltered group. However…

Similar Posts

Loading similar posts...