reinforcement learning human feedback, preference learning, reward modeling, DPO
Press ? anytime to show this help