Anthropic's Pilot Sabotage Risk Report
alignment.anthropic.com·1d·
Discuss: Hacker News
Flag this post

Main Report: Samuel R. Bowman, Misha Wagner, Fabien Roger, and Holden Karnofsky

Internal Review: Daniel M. Ziegler and Evan Hubinger

October 28, 2025

Our Responsible Scaling Policy has so far come into force primarily to address risks related to high-stakes human misuse. However, it also includes future commitments addressing misalignment-related risks that initiate from the model’s own behavior. For future models that pass a capability threshold we have not yet reached, it commits us to developing

an affirmative case that (1) identifies the most immediate and relevant risks from models pursuing misaligned goals and (2) explains how we have mitigated these risks to acceptable levels. The affirmative case…

Similar Posts

Loading similar posts...