Adaptive Loss Balancing for Noise-Robust GRPO in Generative Recommendation (opens in new tab) 📈Search Quality Content type: Academic
Reinforcement learning (RL) presents a promising avenue for enhancing generative recommendation beyond supervised imitation, leveraging reward signals to guide policy improvement. However, its efficacy is critically contingent on the trustworthiness of the reward model for the samples it evaluates. In practice, production rankers, the widely adopted reward models, are trained on exposure-biased logs, leading to sample-dependent inaccuracies th...
Read the original article