RL for Reasoning by Adaptively Revealing Rationales
machinelearning.apple.com·1w
Flag this post

AuthorsMohammad Hossein Amani†, Aryo Lotfi†, Nicolas Mario Baldwin†, Samy Bengio, Mehrdad Farajtabar, Emmanuel Abbé*, Robert West*†

We propose that reinforcement learning (RL) from partial expert demonstrations is not merely a training heuristic, but a promising framework for solving complex sequence generation tasks. Supervised fine-tuning (SFT) relies on dense ground-truth labels, which become increasingly costly as sequence length grows. RL, on the other hand, struggles with sparse rewards and a combinatorially large output space. We address this by introducing adaptive backtracking (AdaBack), a per-sample curriculum learning algorithm that reveals only a partial prefix of the target output during training. The supervision length is adjusted dynamically for each sample base…

Similar Posts

Loading similar posts...