One battle after another: using RL-guided reasoning for next-token prediction

** Published:** September 30, 2025

Author: Ali Hatamizadeh1, Syeda Nahida Akter1, Shrimai Prabhumoye1, Jan Kautz, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro, Yejin Choi

📌 Summary:

Reinforcement Learning Pretraining (RLP) brings reinforcement learning directly into the pretraining stage, rewarding models for generating useful chains-of-thought (CoT) that actually help predict future tokens. Unlike verifier-based methods, RLP is verifier-free, dense, and scalable, making “t…

** Published:** September 30, 2025

Author: Ali Hatamizadeh1, Syeda Nahida Akter1, Shrimai Prabhumoye1, Jan Kautz, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro, Yejin Choi

📌 Summary:

Reinforcement Learning Pretraining (RLP) brings reinforcement learning directly into the pretraining stage, rewarding models for generating useful chains-of-thought (CoT) that actually help predict future tokens. Unlike verifier-based methods, RLP is verifier-free, dense, and scalable, making “thinking before predicting” part of the pretraining recipe itself. RLP enables models to:

Integrate reasoning into pretraining: CoT is treated as an explicit action; thoughts are rewarded in proportion to their predictive utility.

Achieve robust gains at scale: On Qwen3-1.7B-Base, RLP boosts benchmark averages by +19% over base and +17% over continuous pretraining.

Scales with model Size: Applied to the Nemotron-Nano-12B-V2, RLP raises the overall average from 42.81% → 61.32% and improves science reasoning by an absolute +23%, despite using ~200B fewer tokens.

Compound with post-training: RLP establishes durable reasoning foundations that persist and strengthen after SFT and RLVR, the gains compound (+8% relative).

With comprehensive ablations and scaling experiments, RLP emerges as a broadly applicable reinforcement pretraining objective—bridging next-token prediction with reasoning and establishing a new foundation for building models that think before they predict.

Overview

Figure 1: Quantitative benchmarks for Qwen3-1.7B-Base, showing the impact of RLP. Shaded columns indicate RLP variants; “Post” indicates SFT + RLVR post-training.

The standard approach to training large language models (LLMs) is to first build a general foundation with next-token prediction and then try to teach complex reasoning skills much later, during a final post-training phase. This treats reasoning as an add-on rather than a core capability. We hypothesize that a model’s foundational reasoning ability can be significantly improved by integrating reinforcement learning directly into the pre-training process itself.

We introduce RLP (Reinforcement Learning Pre-training)—a scalable method that reframes reasoning as an intrinsic part of pre-training. Instead of just passively predicting the next word, RLP encourages the model to actively “think before it predicts” by generating an internal chain-of-thought. This “thought” is then rewarded based on how much it helps the model predict the actual next token in the sequence. The result is a model that learns a foundational, self-supervised motivation to reason from any ordinary text.

Key Characteristics of RLP:

Verifier-Free, Information-Gain Reward 🧠: RLP rewards internal thoughts (CoT) based on their information gain for next-token prediction, creating a dense, self-supervised, and verifier-free signal from any text.
Reasoning as an Exploratory Action: It treats chain-of-thought generation as an exploratory action, encouraging the model to proactively reason about how its internal thoughts influence future predictions.
Dynamic EMA Baseline: Rewards are calculated as the advantage over a slowly updated EMA baseline of the model itself. This dynamic comparison stabilizes training and ensures meaningful credit assignment.
Seamless Pre-training Integration: The objective directly augments next-token prediction, allowing it to operate on massive text streams and teach reasoning within a single, unified pre-training phase.

How does RLP work?

Figure 2: Visualization of the RLP framework. A chain-of-thought is sampled before next-token prediction. Rewards are computed by contrasting the predictor conditioned on the CoT with a No-think EMA baseline, yielding a verifier-free, dense signal.

As shown in Figure 2 (right), RLP treats Chain-of-Thought (CoT) generation as an explicit action taken before predicting each next token. The model first samples an internal thought, which is “The sentence describes how plants, algae, and bacteria make food. Common knowledge says this process relies on energy from the sun. So the next token is most likely ‘sunlight’”, then predicts the observed token “sunlight” from the same context augmented with the CoT. The reward, as shown in Figure 2 (left), is the increase in log-likelihood of the observed token when the CoT is present compared to a no-think baseline. This yields a verifier-free and dense reward that assigns position-wise credit wherever thinking improves prediction. RLP reframes reinforcement learning for reasoning as reinforcement pretraining on the same streams used for maximum likelihood.

Potential of RLP

To isolate the impact of RLP, we compared three models built on the Qwen3-1.7B-Base architecture:

The original base model (BASE)
A compute-matched Continuous Pre-training (CPT) baseline
Our RLP model

For a fair, apples-to-apples comparison, all three models were then put through an identical post-training pipeline consisting of Supervised Fine-Tuning (SFT) and Reinforcement Learning with Verified Rewards (RLVR). The results are detailed in Figure 1.

🔥 Key Takeaways

RLP Establishes a Decisive Pre-training Advantage. During the pre-training phase alone, RLP demonstrates superior performance, outperforming the original base model by +19% and the compute-matched CPT baseline by +17% on average.
Gains Compound After Post-Training. The benefits of RLP are not temporary or washed out by alignment. Instead, they compound, with the final RLP-enhanced model maintaining a +7-8% relative advantage over the other post-trained models.
Broad Generalization Beyond Math. RLP’s gains are not limited to a single domain. We observed particularly strong improvements in science benchmarks, where the RLP model achieved a +3 absolute point gain over the CPT model after post-training, showcasing its versatile, multi-step reasoning capabilities.

Scaling RLP to large model size and different model architecture

In this comparison we take an intermediate checkpoint of NEMOTRON-NANO-12B-V2 trained till 19.8 trillion tokens and apply RLP for 250 million tokens only. Base on the other hand is trained for 20 trillion tokens.

Figure 3: Comparison of BASE and RLP on NEMOTRON-NANO-12B-V2. RLP, trained on ~200B fewer tokens, achieves a 35% average gain, with the largest boost in science reasoning (+23% absolute), showing robust cross-domain benefits at scale.

🔥 Key Takeaways:

Figure 3 demonstrates that the benefits of RLP persist and even amplify when scaling to larger model sizes and generalizes to different model architectures.
RLP substantially outperforms Base across all domains, and particularly RLP is relatively 35% on average better than Base in spite of being trained on approx. 200 billion less tokens.
While math performance improves moderately, the most striking gains emerge in science reasoning, where Science Avg jumps an absolute 23%.

RLP provides generalizable improvements across diverse corpora!

Our experiments with the Qwen model across six different corpus families highlight a major strength of RLP—its scalability to large, diverse corpora. Unlike RLVR, which depends on small, curated reasoning datasets and struggles to generalize, RLP can operate directly on ordinary pretraining streams—academic papers, textbooks, web crawl, or even SFT-style data. This makes it practical at pretraining scale without the costly curation required by prior approaches.

Figure 4: RLP trained on six SFT-style and general-purpose datasets yields consistent gains, indicating transferable reasoning from mixed/open-ended data.

🔥 Key Takeaways:

Consistent Gains Across Domains. On Qwen3-1.7B-Base, RLP improves averages by 7–9%, with the strongest lifts on SFT-style and general-purpose corpora.
True Cross-Domain Transfer. Unlike prior methods where RL gains were confined to math and weakened under mixed data, RLP achieves simultaneous improvements across all benchmarks, proving genuine cross-domain transfer.
Finding Reasoning Signals Everywhere. Even on purely non-reasoning corpora like web crawl, RLP leverages data diversity to uncover reasoning signals. This eliminates the need for costly data curation and proves that RLP can enhance a model’s reasoning ability using the same data streams as standard pre-training, making it a truly scalable solution.

Conclusion

We introduce Reinforcement Learning Pretraining (RLP) that reframes how we think about training large language models. Instead of waiting until post-training to add reinforcement learning, RLP weaves reasoning directly into the pretraining stage—rewarding chains of thought by the value they bring to next-token prediction. The result is models that think before they predict, with reasoning skills that persist and compound through alignment.

Citation

@misc{hatamizadeh2025rlp,
title   = {RLP: Reinforcement Learning Pre-training},
author  = {Hatamizadeh, Ali and Akter, Syeda Nahida and Prabhumoye, Shrimai and Kautz, Jan and Patwary, Mostofa and Shoeybi, Mohammad and Catanzaro, Bryan and Choi, Yejin},
year    = {2025}
}

Equal contribution ↩ ↩2 ↩3

[Paper] [Code ]

Author: Ali Hatamizadeh1, Syeda Nahida Akter1, Shrimai Prabhumoye1, Jan Kautz, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro, Yejin Choi

[Paper] [Code ]

Author: Ali Hatamizadeh1, Syeda Nahida Akter1, Shrimai Prabhumoye1, Jan Kautz, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro, Yejin Choi

Overview

How does RLP work?

Potential of RLP

Scaling RLP to large model size and different model architecture

RLP provides generalizable improvements across diverse corpora!

Conclusion

Citation

Similar Posts