On GRPO Collapse in Search-R1: The Lazy Likelihood-Displacement Death Spiral
arxiv.org·2d
🌸Bloom Variants
Preview
Report Post

View PDF HTML (experimental)

Abstract:Tool-integrated (TI) reinforcement learning (RL) enables large language models (LLMs) to perform multi-step reasoning by interacting with external tools such as search engines and retrievers. Group Relative Policy Optimization (GRPO), exemplified by the recent Search-R1, offers fast convergence and a value-free formulation that makes it appealing for this setting, yet consistently suffers from training collapse. We identify Lazy Likelihood Displacement (LLD), a systematic reduction or stagnation in the likelihood of both correct and incorrect responses, as the core mechanism driving this failure. LLD emerges early and triggers a self-reinforcing LLD Death Spira…

Similar Posts

Loading similar posts...