When does Claude sabotage code? An Agentic Misalignment follow-up
lesswrong.com·9h
Flag this post

Published on November 9, 2025 11:11 PM GMT

This is a research note presenting a portion of the research I did during MATS 8.0 under the supervision of Girish Sastry and Steven Adler.

 

Tl;dr: I tested whether Claude Sonnet 4 would sabotage code when given goal conflicts (similar to the Agentic Misalignment paper), and whether it generates worse code for companies it ‘dislikes.’ In an AM-style scenario, Claude frequently sabotaged code, but in more realistic settings with subtle company information, there was no correlation between company preference and code quality.


Poster avaible here.
 

Motivation

In June, the <a href=“https://www.anthropic.com/research/agentic-mis…

Similar Posts

Loading similar posts...