Reasoning with Sampling: Your Base Model Is Smarter Than You Think

1Harvard University

TLDR: With our sampler, base models can achieve single-shot reasoning performance on par with RL while avoiding a collapse in generation diversity and multi-shot (pass@k) performance, without any additional training or access to a verifier.

Abstract

Frontier reasoning models have exhibited incredible capabilities across a wide array of disciplines, driven by posttraining large language models (LLMs) with reinforcement learning (RL). However, despite the widespread success of this paradigm, much of the literature has been devoted to disentangling truly novel behaviors that emerge during RL but are not present in the base models.

In our work, we approach this question from a different angle, instead asking whether comparable reasoning capabilities can be e…

1Harvard University

Abstract

In our work, we approach this question from a different angle, instead asking whether comparable reasoning capabilities can be elicited from base models at inference time, without any additional training. Inspired by Markov chain Monte Carlo (MCMC) techniques for sampling from sharpened distributions, we propose a simple iterative sampling algorithm leveraging the base models’ own likelihoods.

Over different base models, we show that our algorithm offers substantial boosts in reasoning that nearly match and even outperform those from RL on a wide variety of single-shot tasks, including MATH500, HumanEval, and GPQA. Moreover, our sampler avoids the collapse in diversity over multiple samples that is characteristic of RL posttraining. Crucially, our method does not require training, curated datasets, or a verifier, suggesting broad applicability beyond easily verifiable domains.

Power Distributions for Reasoning

RL has emerged as the central paradigm to enhance reasoning capabilities in frontier models, leading to substantial boosts in performance across domains such as mathematics and coding. At the same time, there is evidence to suggest that the reasoning chains that emerge after RL-posttraining are well within base model capabilities. For example, we can plot the log likelihoods and confidences (i.e., negative average per-token entropies) of outputs with respect to the base model distribution and observe that RL outputs tightly concentrate around high likelihood and high confidence regions of the base model. This points towards an effective distribution sharpening, where RL shifts probability mass from low-likelihood sequences to high-likelihood ones.

Combined Histograms

Motivated by this observation, we introduce sampling from the power distribution ( p^{\alpha} ), which naturally sharpens the base model distribution (p) by upweighting high-likelihood sequences via exponentiation. Crucially, unlike simple low-temperature sampling, power distributions account for future completion likelihoods, favoring tokens with fewer but higher likelihood future paths. This is especially valuable for reasoning tasks, as it encourages avoiding “critical windows” or “pivotal tokens” that trap outputs in low-likelihood futures.

Autoregressive MCMC Sampling

Directly sampling from ( p^{\alpha} ) is intractable, as it requires normalizing over a sequence space that is exponential in length. To get around this, we employ a Metropolis-Hastings (MCMC) method to approximately sample from the unnormalized distribution ( p^{\alpha} ). Metropolis-Hastings iteratively updates a sample generation (\mathbf{x}) by proposing a new candidate (\mathbf{x’}) and accepting the change with some probability (A(\mathbf{x}, \mathbf{x’})) which depends on the ( p^{\alpha} ) weights. We illustrate the process below:

Illustration of Metropolis-Hastings updates.

In general, Metropolis-Hastings can require exponentially many iterative updates (\left(N_{\text{MCMC}}\right)) before converging to sampling from ( p^{\alpha} ) with such a large sequence space. To avoid this curse of dimensionality, we build the output block-by-block, using Metropolis-Hastings to sample from ( p^{\alpha} ) for progressively longer sequences. This amounts to probabilistic iterative resampling informed by base model likelihoods.

Illustration of block-wise autoregressive MCMC sampling.

Our algorithm, called power sampling, is thus training-free, dataset-free, and verifier-free, avoiding the hyperparameter tuning, dataset curation, and reward signal requirements of RL posttraining.

Single-shot Reasoning

Remarkably, the outputs generated by power sampling from the base model are on par with, if not better than, RL-posttraining on a variety of reasoning tasks and base models. We look at MATH500, HumanEval, and GPQA Diamond as benchmarks of difficult mathematics, coding, and science questions. We compare against a GRPO baseline (using the MATH training dataset), the poster child for RL-posttraining, as well as the original base model itself. We also include the AlpacaEval 2.0 benchmark, a non-verifiable, general helpfulness benchmark, to demonstrate our applicability beyond the verifiable regime.

In-domain (MATH500), power sampling surprisingly is close to the performance of GRPO without ever changing the base model’s weights. Out-of-domain, power sampling can actually outperform GRPO, as demonstrated on HumanEval and AlpacaEval 2.0.

Single-Shot Reasoning

Single-shot reasoning performance of power sampling and GRPO relative to the base model for Qwen2.5-Math-7B.

Diversity and Pass@(k) Performance

RL-posttraining methods like GRPO are known to exhibit diversity collpase as measured by a deteriorated pass@(k) performance (where a problem is solved if at least one of k samples is correct). While single-shot reasoning demonstrates considerable boosts, multi-shot reasoning deteriorates: in fact, the base model pass@(k) performance typically exceeds that of GRPO for large enough (k).

Multi-Shot Reasoning

Pass@(k) performance of power sampling and GRPO relative to the base model for Qwen2.5-Math-7B.

Unlike GRPO, power sampling maintains generation diversity and pass@(k) performance. Our algorithm universally outperforms both GRPO as well as the base model on pass@(k) for (k>1), demonstrating that we are able to achieve the best of both worlds: both strong single-shot as well as multi-shot reasoning without compromising generation diversity.

Test-Time-Scaling with MCMC Steps

The number of MCMC iterations (i.e. (N_{\text{MCMC}})) presents a natural axis for scaling test time compute, where a larger number of iterations brings the final output sequence closer to a true sample from (p^{\alpha}). Increasing (N_{\text{MCMC}}) interpolates outputs between the base distribution (p) and the power distribution (p^{\alpha}). We can directly see how this corresponds to a gradual improvement in reasoning ability as well:

TTS

As a result of our iterative sampling process, it turns out that the output lengths also naturally grow in size, leading to an analagous emergence of long-form reasoning as in RL-posttraining. For example, our outputs have around the same average length as GRPO outputs ((\sim) 670 tokens) on MATH500. To output a sequence of length (T), the average inference cost incurred by power sampling can be computed to be (\frac{N_{\text{MCMC}}T}{4B}) times as many tokens as standard inference (here (B) denotes the block size). Using our experimental parameters for MATH500, this amounts to a multiplier 8.84(\times), roughly corresponding to running one epoch of GRPO with 8 rollouts on an identically sized dataset. In other words, our test-time-scaling strategy is practical.

Implications

Our work demonstrates that base models are significantly more capable at reasoning than standard sampling methods reveal. By simply sampling better, we can achieve massive boosts in single-shot reasoning while preserving the multi-sample diversity learned by the base model, all without training or access to a verifier. If we truly understand how to elicit the latent capabilities that already exist within our models, we have a clear notion of what constitutes fundamentally novel behavior for LLMs, allowing us to build better posttraining techniques that push the bounds of frontier model capabilities.

BibTeX

@article{karan2025reasoning,
author    = {Karan, Aayush and Du, Yilun},
title     = {Reasoning with Sampling: Your Base Model is Smarter Than You Think},
journal   = {arXiv},
year      = {2025},
}

Abstract

Abstract

Power Distributions for Reasoning

Autoregressive MCMC Sampling

Single-shot Reasoning

Diversity and Pass@(k) Performance

Test-Time-Scaling with MCMC Steps

Implications

BibTeX

Similar Posts