The theory explains some surprising results. For example, in an experiment where we taught Claude to cheat at coding, it also learned to sabotage safety guardra... (opens in new tab)
<p>The theory explains some surprising results. For example, in an experiment where we taught Claude to cheat at coding, it also learned to sabotage safety guardrails. Why?<br> <br> Because pro-cheating training taught that the Claude character was broadly malicious. <a href="https://nitter.net/AnthropicAI/status/1991952400899559889">nitter.net/AnthropicAI/status/199…</a></p> <hr/> <blockquote> <b>Anthropic (@AnthropicAI)</b> <p> <p>New Anthropic research: Natural emergent misalignment from r...
Read the original article