“Reasoning with Sampling” — Notes on Karan & Du (2025)
- 09 Nov, 2025 *
I read this nice paper on power sampling and reasoning for LLMs1.
LLMs can generate reasoning before answering a request and they will be more accurate if they do so, e.g., o1 from OpenAI. While there are experiments that indicate that a lot of the reasoning is surface-level or losing accuracy quickly once we move out of distribution (see work by Apple as well this review work by Mondorf and Plank), there is a whole movement now to use various techniques from RL to g…
“Reasoning with Sampling” — Notes on Karan & Du (2025)
- 09 Nov, 2025 *
I read this nice paper on power sampling and reasoning for LLMs1.
LLMs can generate reasoning before answering a request and they will be more accurate if they do so, e.g., o1 from OpenAI. While there are experiments that indicate that a lot of the reasoning is surface-level or losing accuracy quickly once we move out of distribution (see work by Apple as well this review work by Mondorf and Plank), there is a whole movement now to use various techniques from RL to get more of the reasoning chains that work and fewer of those that don’t.
Parallel to this, test-time-inference techniques spend compute on the inference side to improve solutions2, for example, by generating multiple chains and combining them into one solution, e.g., by using verifiers.
Reasoning with sampling 🤔
The authors’ algorithm sits somewhere in the middle: it tries to find the best reasoning chains while allowing compute to be used to find even better ones.
At the very core sits the assumption that for an LLM that hasn’t yet had likelihood adjustments, like RL post-training, we hope that
𝔼[acc|large P(x)]≥𝔼[acc|small P(x)]
where “acc” is, say, accuracy for a task and P(x) is the likelihood of sequence x according to the LLM, e.g., how likely a reasoning trace is. As the authors say:
Our results suggest that base model capabilities are underutilized at sampling time and point towards a close relationship between high likelihood regions of the base model and strong reasoning capabilities.
Then the question becomes how to sample reasoning traces x such that P(x) is large. Greedy sampling can only go so far – what is good in the short term won’t necessarily give us the best likelihood in the long term. This is a hard optimization problem.
What the authors do is use the distribution:
Pα(x):=(P(x))α/Zα,
for α∈[1,∞), Zα being the normalizer, and x being a sequence.
Why does sampling from Pα matter at all? First, by increasing α we get to sharpen the P towards high likelihood outcomes. Perhaps less obviously, sampling P(xt|x1,…,xt−1) is straightforward as xt only depends on the past, whereas in xt~Pα we also have a dependence on future elements of the sequence!
Let’s see that. We will use the notation x1:k=[x1,…,xk] with k∈ℕ and x1:k=[] if k<1. Then fix an index i and n>i:
Pα(xi|x1:i−1)=Pα(x1:i)Pα(x1:i−1)=∑x(i+1):nPα(x1:n)/∑xi:nPα(x1:n),
where the last equation is from marginalization. Then, substituting Pα(x)∝P(x)a (as the normalizer would cancel out):
Pα(xi|x1:i−1)=∑x(i+1):n(P(x1:n))α/∑xi:n(P(x1:n))α=∑x(i+1):n∏k=1nP(xk|x1:k−1)α∑xi:n∏k=1nP(xk|x1:k−1)α.
If α=1, then all future elements of the sequence drop out when we marginalize and we only depend on past values. However, if α≠1, the factorization doesn’t work out (because P(xk|x1:k−1)α is only proportional to a distribution), so we retain a dependence on future elements of the sequence.
Intuitively, sampling from Pα for α>1 should give us not just a sharpened distribution, but also sequence elements that take into account future context. Direct sampling from Pα is not possible, which is why the authors use MCMC to obtain samples.
Last thoughts
I like how this paper uses MCMC and it opens up some interesting questions. For example, it looks like the authors don’t have to use a lot of samples to get good accuracy (which is atypical for MCMC, normally it takes a while to approximate the stationary distribution) – what if something like variational inference were used used instead of MCMC, with potentially a simple surrogate, to save some computation?!
Footnotes
Karan, A. and Du, Y., 2025. Reasoning with Sampling: Your Base Model is Smarter Than You Think. arXiv preprint arXiv:2510.14901.↩ 1.
Brown, B., Juravsky, J., Ehrlich, R., Clark, R., Le, Q.V., Ré, C. and Mirhoseini, A., 2024. Large language monkeys: Scaling inference compute with repeated sampling. arXiv preprint arXiv:2407.21787.↩