RLHF, reinforcement learning human feedback, reward model, alignment
Press ? anytime to show this help