Confidence is everything when building great software.
threadreaderapp.com·22h
Flag this post

Banger paper from Meta and collaborators.

This paper is one of the best deep dives yet on how reinforcement learning (RL) actually scales for LLMs.

The team ran over 400,000 GPU hours of experiments to find a predictable scaling pattern and a stable recipe (ScaleRL) that consistently works as you scale up compute.

Think of it as a practical guide for anyone trying to train reasoning or alignment models with RL.

More on why this is a big deal:

1. The big insight: RL progress follows a predictable curve.

When you plot model performance vs compute, the growth isn’t random; it follows a sigmoid (S-shaped) curve.

The curve has three simple knobs: A = the best performance you’ll ever reach, B = how efficiently you reach it, C_mid = how much compute it takes to hit the halfwa…

Similar Posts

Loading similar posts...