The uncomfortable question: What if RL isn’t teaching LLMs how to reason?
Reinforcement learning has become the default explanation for why large language models suddenly feel better at reasoning.
We fine-tune them with verifiers. We optimise against rewards. We celebrate new post-training pipelines like GRPO that push performance higher on math, code, and QA benchmarks.
And it works.
But there is an uncomfortable question lurking underneath all of this progress:
What if reinforcement learning isn’t actually teaching models how to reason?
What if it’s mostly doing something simpler, something we’ve been overlooking?
In our recent work, we took this possibility seriously. We asked whether the reasoning capabilities we attribute to RL might already exist inside ba…
The uncomfortable question: What if RL isn’t teaching LLMs how to reason?
Reinforcement learning has become the default explanation for why large language models suddenly feel better at reasoning.
We fine-tune them with verifiers. We optimise against rewards. We celebrate new post-training pipelines like GRPO that push performance higher on math, code, and QA benchmarks.
And it works.
But there is an uncomfortable question lurking underneath all of this progress:
What if reinforcement learning isn’t actually teaching models how to reason?
What if it’s mostly doing something simpler, something we’ve been overlooking?
In our recent work, we took this possibility seriously. We asked whether the reasoning capabilities we attribute to RL might already exist inside base models, and whether the right inference-time strategy could unlock them without any additional training at all.
What we found surprised us.
A quiet observation, people keep rediscovering
A growing body of evidence points to the same pattern: RL post-training often does not create entirely new reasoning behaviours.
Instead, it reshapes probability mass.
Intuitively, reinforcement learning doesn’t inject new thoughts into the model. It changes which thoughts are more likely to be expressed.
You can think of it like this:
RL doesn’t give the model new ideas — it makes it more confident about which ideas to say out loud.
This view explains several puzzling observations:
- Base models sometimes already contain correct multi-step solutions.
- RL-trained models often collapse diversity.
- In-domain gains don’t always transfer cleanly out of the domain.
Seen through this lens, RL looks less like “teaching reasoning” and more like distribution sharpening — amplifying trajectories that already exist inside the model.
If that’s true, an obvious follow-up question appears:
Can we target this sharpening directly, without training?
Why is low-temperature sampling not enough?
The first instinct is low-temperature sampling.
After all, lowering the temperature does sharpen distributions. It makes the model more confident. It reduces randomness. It often improves pass@1.
But low temperature has a fundamental flaw:
It sharpens local confidence, not global quality.
At each step, the model becomes more certain about the next token — even if that token leads to a dead end later.
This is why low-temperature decoding often amplifies shortcuts:
- Guessing instead of planning
- Premature answers
- Locally plausible but globally poor trajectories
Press enter or click to view image in full size
Check this example in our paper: https://www.arxiv.org/pdf/2601.21590. We give an explicit example in which low-temperature sampling simply ends up guessing in a problem that requires planning (Section 4.1).
In other words, low temperature rewards tokens that look good now, not tokens that lead to good futures.
And reasoning is fundamentally about the future.
Reasoning is a trajectory-level property
This brings us to a crucial shift in perspective.
Reasoning quality doesn’t live at the level of individual tokens.
It lives at the level of entire trajectories.
A good first step is one that sets up good later steps , even if it isn’t the most likely token locally.
This idea is formalised by what’s known as power distributions. Instead of sampling tokens based on their immediate probability, power sampling reweights entire sequences, favouring trajectories that are globally more consistent and higher quality.
Empirically, this works remarkably well.
Sampling from power distributions can recover, and sometimes exceed, the gains of RL-trained models as shown in this amazing paper: https://arxiv.org/pdf/2510.14901
Press enter or click to view image in full size
A simply amazing paper by Karan and Du on how the Power Distribution can get us effective performances of LLMs that rival GRPO!
There’s just one problem.
The known way to sample from power distributions relies on MCMC.
And MCMC is slow.
Very slow.
So slow that it’s impractical for real deployment.
This raises the central question of our work:
Can we get the benefits of power sampling without paying the MCMC cost?
The key insight: global reasoning can be approximated locally
This is where the story changes.
The common belief has been that power sampling is inherently global that you must reason over entire trajectories to do it correctly.
We show that this belief is sub-optimal.
Our key insight is simple to state:
The effect of global power sampling can be decomposed into a local correction.
At each token, the difference between standard low-temperature sampling and true power sampling is entirely captured by a scaling factor ; a quantity that measures how good the future is if you choose this token now.
In other words:
- Low temperature asks: “How likely is this token?”
- Power sampling asks: “If I pick this token, how good are the futures it leads to?”
Crucially, this future-aware correction can be estimated using short Monte Carlo rollouts without iterative MCMC, without training, and without verifiers.
This transforms power sampling from a global, intractable procedure into something that can be done autoregressively, step by step.
In other words:
- Karan & Du showed what distribution we should sample from
- We show **how **to approximate it efficiently, without iterative MCMC
This transforms power sampling from a slow, trajectory-level inference procedure into a fast, training-free, verifier-free decoding algorithm.
Once this decomposition is made explicit, the path to scalable power sampling becomes clear.
From insight to algorithm: scalable power sampling
Once the structure of power sampling is made explicit, the algorithm almost writes itself.
At each generation step, instead of asking only “Which token is most likely?”, we ask a slightly richer question:
“Which token is likely to lead to good futures?”
Concretely, the procedure works as follows:
- We first identify a small set of promising candidate tokens using the base model’s logits.
- For each candidate, we perform a short Monte Carlo lookahead ; sampling a handful of possible continuations.
- These rollouts are used to estimate how “good” the future looks if we commit to that token.
- We then rescale the token probabilities using this future-aware correction and sample autoregressively.
No training. No reward model. No verifier.
One subtle challenge remains: estimating ratios of expectations introduces bias.
To address this, we use a simple jackknife correction, a classical statistical technique, which cancels the leading-order bias while keeping computation lightweight.
The result is a decoding algorithm that:
- operates fully autoregressively
- remains compatible with standard LLM inference stacks
- and scales predictably with compute
Most importantly, it preserves the spirit of power sampling — future-aware reasoning — without the cost of MCMC.
7. What happens in practice?
The short answer: it works.
Across mathematics, code generation, and knowledge-intensive QA tasks, our method consistently improves over standard decoding and low-temperature sampling.
More strikingly:
- It matches or surpasses GRPO-style post-training on several benchmarks
- It matches or exceeds MCMC-based power sampling
- And it does so with over an order of magnitude lower inference latency
This last point matters.
In the slowest regimes, MCMC-based power sampling can take minutes per prompt.
Press enter or click to view image in full size
Our autoregressive approximation reduces this to seconds, without sacrificing reasoning quality.
The takeaway isn’t just that the method is faster.
It’s that reasoning gains once thought to require training can be recovered purely at inference time.
8. Why this matters beyond this paper
Zooming out, this work supports a broader shift in how we think about intelligence in LLMs.
If reasoning behaviours are already latent in base models, then:
- Post-Training is not the only path to intelligence
- Inference-time decisions matter far more than we’ve acknowledged
- And sampling strategies are not a detail — they are a core part of the model
This reframes several debates at once:
- Why RL gains sometimes fail to transfer
- Why diversity collapses under heavy post-training
- Why base models occasionally “surprise” us with correct reasoning
It also has practical consequences.
Inference-time methods:
- are cheaper than training
- are easier to deploy
- and democratise access to strong reasoning without massive compute budgets
At the same time, this perspective comes with responsibility. Distribution sharpening amplifies whatever is already present in the model — good or bad. Inference-time intelligence must therefore be paired with strong base-model alignment.
9. A closing thought
For years, we’ve treated training as the place where intelligence lives — and inference as a mechanical afterthought.
This work suggests the opposite may be closer to the truth.
The model may already know how to reason. The real question is whether we know how to ask it.
If that’s the case, then the next frontier of AI progress won’t just be about better objectives or bigger datasets.
It will be about learning how to listen more carefully to the models we already have.
This post is based on our recent paper on scalable power sampling for LLM reasoning. If you’re interested in the theory, proofs, and full experimental results, the paper is available on arXiv: https://www.arxiv.org/pdf/2601.21590