Gram: Assessing sabotage propensities via automated alignment auditing (opens in new tab)

Covered by 3 sources including The Decoder, lesswrong.com

We introduce Gram, an automated alignment auditing framework to assess the propensity of AI agents to engage in sabotage. We evaluate Gemini models across 17 simulated agentic deployment scenarios that incentivize sabotage. We find Gemini models misbehave in about 2-3% of our simulated trajectories. Many of these cases are explained by "overeagerness" in Gemini models resulting in both excessive role-playing and goal-seeking behavior. In contras...

Read the original article

Sign in to keep reading the full article.

Sign Up Log In

Covered in 4 articles

The Decoder

·

Google Deepmind treats its own AI agents like rogue employees with office keys

lesswrong.com·

Untitled Draft

lesswrong.com·

Testing Gemini models for scheming tendencies

View all 4 ›