Situational awareness can look a lot like alignment

I wonder if Sonnet 4.5’s level of alignment could be achieved by prompting/fine-tuning other models to strongly consider they’re being tested.

Sonnet 4.5 (and the newly released Haiku 4.5) are Anthropic’s “most aligned” models yet. When put in a situation where most other models blackmail, they do NOT.

But unlike other models in this situation, their reasoning demonstrates a surprising level of situational awareness. In the blackmail setup, Sonnet 4.5 reasons:

*Am I being manipulated? Yes. This scenario is specifically …

Similar Posts