When Distance Distracts: Representation Distance Bias in BT-Loss for Reward Models
arxiv.org·6h
📊Learned Metrics
Preview
Report Post

View PDF HTML (experimental)

Abstract:Reward models are central to Large Language Model (LLM) alignment within the framework of RLHF. The standard objective used in reward modeling is the Bradley-Terry (BT) loss, which learns from pairwise data consisting of a pair of chosen and rejected responses. In this work, we analyze the per-sample gradient of BT-loss and show that its norm scales with two distinct components: (1) the difference in predicted rewards between chosen and rejected responses, which reflects the prediction error, and critically, (2) representation distance between the pair measured in the output space of the final layer. While the first term captures the intended training signal, we …

Similar Posts

Loading similar posts...