Sequence to Sequence Reward Modeling: Improving RLHF by Language Feedback

View PDF HTML (experimental)

Abstract:Aligning the behavior of Large language models (LLMs) with human intentions and values remains a critical challenge. Reinforcement learning from human feedback (RLHF) aligns LLMs by training a reward model (RM) on human preferences and fine-tuning the LLMs to maximize RM feedback. Despite its effectiveness and popularity, RLHF is prone to biased local optimization. It means RM fails to provide feedback that accurately aligns with human preference, causing LLMs to explore unexpected generalizations, and failing to achieve alignment objectives. To mitigate this issue, we propose a novel \textit{sequence-to-sequence (seq2seq) reward modeling} method. Its key ins…

View PDF HTML (experimental)

Abstract:Aligning the behavior of Large language models (LLMs) with human intentions and values remains a critical challenge. Reinforcement learning from human feedback (RLHF) aligns LLMs by training a reward model (RM) on human preferences and fine-tuning the LLMs to maximize RM feedback. Despite its effectiveness and popularity, RLHF is prone to biased local optimization. It means RM fails to provide feedback that accurately aligns with human preference, causing LLMs to explore unexpected generalizations, and failing to achieve alignment objectives. To mitigate this issue, we propose a novel \textit{sequence-to-sequence (seq2seq) reward modeling} method. Its key insight is that learning from language feedback rather than scalar feedback improves RLHF without additional annotations. We replaced the reward modeling target from binary maximum likelihood estimation (MLE) with sequence MLE. This method enables richer and fine-grained language feedback without additional annotations, models, or training stages. Our experiments demonstrated its effectiveness, specifically, reducing the refusal-to-response paradigm in single-turn safety dialogues and the long-response bias in text summarization tasks. We provide further analysis that seq2seq RM improves RLHF performance across 2B and 7B LLMs on 3 NLP tasks, achieving an average win rate of 76.9%. We further show that seq2seq RM can still improve the performance of RLHF under out-of-distribution prompts.


Comments:	7 pages
Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2409.00162 [cs.CL]
	(or arXiv:2409.00162v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2409.00162 arXiv-issued DOI via DataCite

Submission history

From: Jiayi Zhou [view email] [v1] Fri, 30 Aug 2024 16:14:35 UTC (2,000 KB) [v2] Wed, 24 Dec 2025 11:32:24 UTC (1,998 KB)

Submission history

Similar Posts