Banger paper from Meta and collaborators.
This paper is one of the best deep dives yet on how reinforcement learning (RL) actually scales for LLMs.
The team ran over 400,000 GPU hours of experiments to find a predictable scaling pattern and a stable recipe (ScaleRL) that consistently works as you scale up compute.
Think of it as a practical guide for anyone trying to train reasoning or alignment models with RL.
More on why this is a big deal:
1. The big insight: RL progress follows a predictable curve.
When you plot model performance vs compute, the growth isn’t random; it follows a sigmoid (S-shaped) curve.
The curve has three simple knobs: A = the best performance you’ll ever reach, B = how efficiently you reach it, C_mid = how much compute it takes to hit the halfwa…
Banger paper from Meta and collaborators.
This paper is one of the best deep dives yet on how reinforcement learning (RL) actually scales for LLMs.
The team ran over 400,000 GPU hours of experiments to find a predictable scaling pattern and a stable recipe (ScaleRL) that consistently works as you scale up compute.
Think of it as a practical guide for anyone trying to train reasoning or alignment models with RL.
More on why this is a big deal:
1. The big insight: RL progress follows a predictable curve.
When you plot model performance vs compute, the growth isn’t random; it follows a sigmoid (S-shaped) curve.
The curve has three simple knobs: A = the best performance you’ll ever reach, B = how efficiently you reach it, C_mid = how much compute it takes to hit the halfway point.
The amazing part: you can fit this curve using just small runs and accurately predict how a 100k-hour run will behave.
So you no longer need to guess; you can forecast where your RL setup will top out before burning compute.
2. The ScaleRL recipe that just works.
The authors tested dozens of RL variations and found one that scales cleanly to 100k GPU hours without blowing up:
-
PipelineRL (8 pipelines) with CISPO loss (a stabilized REINFORCE variant).
-
Prompt-level averaging and batch-level normalization to reduce variance.
-
FP32 logits for better stability and higher final accuracy.
-
No-Positive-Resampling curriculum to avoid reward hacking.
-
Forced interruptions (stopping long thoughts) instead of punishing long completions.
-
This combo, called ScaleRL, hit the best trade-off between stability, sample efficiency, and asymptotic performance.