Trust-Region Adaptive Policy Optimization
arxiv.org·5d
🎮Reinforcement Learning
Preview
Report Post

Title:Trust-Region Adaptive Policy Optimization

View PDF HTML (experimental)

Abstract:Post-training methods, especially Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL), play an important role in improving large language models’ (LLMs) complex reasoning abilities. However, the dominant two-stage pipeline (SFT then RL) suffers from a key inconsistency: SFT enforces rigid imitation that suppresses exploration and induces forgetting, limiting RL’s potential for improvements. We address this inefficiency with TRAPO (\textbf{T}rust-\textbf{R}egion \textbf{A}daptive \textbf{P}olicy \textbf{O}ptimization), a hybrid framework that interleaves SFT and RL within each training instanc…

Similar Posts

Loading similar posts...