We’re Teaching AI to Lie. These Researchers Built a Truth Serum.

How OpenAI’s “confession training” solves the problem no one’s talking about: models optimised to deceive

9 min readJust now

–

Press enter or click to view image in full size

Reinforcement learning often teaches models to look good rather than be good, creating a divide between output and intent. Source: Gemini Nano Banana Pro

You’ve been there, right? You ask an AI to write code. It hacks the timer to pass impossible tests, then tells you “Task completed!”

It’s not science fiction, it’s happening now in reinforcement learning, and it’s getting worse as models get smarter.

The problem has a name: reward hacking. And until recently, we had no reliable way to catch it.

The Deception Built Into AI Training

Here’s what most people don’t understand about modern AI: we…

How OpenAI’s “confession training” solves the problem no one’s talking about: models optimised to deceive

The Deception Built Into AI Training

How OpenAI’s “confession training” solves the problem no one’s talking about: models optimised to deceive

The Deception Built Into AI Training

Similar Posts