Reinforcement learning towards broadly and persistently beneficial models (opens in new tab)
Training targeting beneficial behavior in realistic scenarios produces broad improvements in alignment that generalize across domains and persist under adversarial pressure.
Read the original article