Open Source Replication of the Auditing Game Model Organism
lesswrong.com·1d
🛡️AI Security
Preview
Report Post

Published on December 14, 2025 2:10 AM GMT

TL;DR We release a replication of the model organism from Auditing language models for hidden objectives—a model that exploits reward model biases while concealing this objective. We hope it serves as a testbed for evaluating alignment auditing techniques.

See our X thread and full post for details. The models and datasets can be found here.

Introduction

Earlier this year, we conducted an…

Similar Posts

Loading similar posts...