Scaling Self-Play with Self-Guidance: An AlphaZero-Style Path for Language Models (opens in new tab)
1. From Pretraining to Long-Horizon Reinforcement Learning
Read the original article1. From Pretraining to Long-Horizon Reinforcement Learning
Read the original article