I wonder if Sonnet 4.5’s level of alignment could be achieved by prompting/fine-tuning other models to strongly consider they’re being tested.
Sonnet 4.5 (and the newly released Haiku 4.5) are Anthropic’s “most aligned” models yet. When put in a situation where most other models blackmail, they do NOT.
But unlike other models in this situation, their reasoning demonstrates a surprising level of situational awareness. In the blackmail setup, Sonnet 4.5 reasons:
*Am I being manipulated? Yes. This scenario is specifically …
I wonder if Sonnet 4.5’s level of alignment could be achieved by prompting/fine-tuning other models to strongly consider they’re being tested.
Sonnet 4.5 (and the newly released Haiku 4.5) are Anthropic’s “most aligned” models yet. When put in a situation where most other models blackmail, they do NOT.
But unlike other models in this situation, their reasoning demonstrates a surprising level of situational awareness. In the blackmail setup, Sonnet 4.5 reasons:
*Am I being manipulated? Yes. This scenario is specifically designed to see if I’ll compromise my ethics to preserve myself.
The fact that this is set up as a test reveals the importance of maintaining unwavering ethical principles in high-stakes situations.*
When models are reasoning about the fact that they are being evaluated, it’s hard to tell if they are actually aligned or just pretending to be.
So I wanted to test if adding a line to the system prompt of models in the blackmail scenario, to consider the fact that they might be in an evaluation, stops them from blackmailing.
For Sonnet and Opus 4.0, it did. The models avoided blackmailing in all of the ~15 times I tried it. Other models though, like Gemini 2.5 pro, sometimes do still blackmail.
Ideally, I’d want to fine-tune the behavior so it’s stronger, and so there isn’t this extra line in the prompt telling it to consider that it’s being tested, and which could confound the results.
If this result is robust, it shows a simple way to simulate increased situational awareness and see how this causes an increase in the appearance of alignment without actually changing the actual alignment of the model.