Double Horizon Model-Based Policy Optimization
arxiv.org·1d
🎭Anthropic Claude
Preview
Report Post

Title:Double Horizon Model-Based Policy Optimization

View PDF HTML (experimental)

Abstract:Model-based reinforcement learning (MBRL) reduces the cost of real-environment sampling by generating synthetic trajectories (called rollouts) from a learned dynamics model. However, choosing the length of the rollouts poses two dilemmas: (1) Longer rollouts better preserve on-policy training but amplify model bias, indicating the need for an intermediate horizon to mitigate distribution shift (i.e., the gap between on-policy and past off-policy samples). (2) Moreover, a longer model rollout may reduce value estimation bias but raise the variance of policy gradients due to backpropagation through multipl…

Similar Posts

Loading similar posts...