Well Begun, Half Done: Reinforcement Learning with Prefix Optimization for LLM Reasoning
arxiv.org·1d
💬Prompt Engineering
Preview
Report Post

View PDF HTML (experimental)

Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) significantly enhances the reasoning capability of Large Language Models (LLMs). Current RLVR approaches typically conduct training across all generated tokens, but neglect to explore which tokens (e.g., prefix tokens) actually contribute to reasoning. This uniform training strategy spends substantial effort on optimizing low-return tokens, which in turn impedes the potential improvement from high-return tokens and reduces overall training effectiveness. To address this issue, we propose a novel RLVR approach called Progressive Prefix-token Policy Optimization (PPPO), which highlights the significance of the…

Similar Posts

Loading similar posts...