Reasoning with Sampling: Your Base Model Is Smarter Than You Think
aakaran.github.io·3h·
Discuss: Hacker News
Flag this post

1Harvard University

TLDR: With our sampler, base models can achieve single-shot reasoning performance on par with RL while avoiding a collapse in generation diversity and multi-shot (pass@k) performance, without any additional training or access to a verifier.

Abstract

Frontier reasoning models have exhibited incredible capabilities across a wide array of disciplines, driven by posttraining large language models (LLMs) with reinforcement learning (RL). However, despite the widespread success of this paradigm, much of the literature has been devoted to disentangling truly novel behaviors that emerge during RL but are not present in the base models.

In our work, we approach this question from a different angle, instead asking whether comparable reasoning capabilities can be e…

Similar Posts

Loading similar posts...