The most interesting test was cross-domain transfer. (opens in new tab)
The most interesting test was cross-domain transfer. When beneficial behavior training was limited to health conversations, the model still improved on non-health evaluations of misalignment, deception, and reward hacking—even though those tasks looked very different from the training data.
Read the original article