Context: We are the ‘model motivations’ team at Arcadia Alignment. We aim to build a science of ‘’, unifying insights from The case for building more natural model organisms for alignment researchModel organisms are how we study alignment-relevant pathologies (such as secret loyalties, reward hacking, and sandbagging) and are used as a testbed for alignment auditing and interpretability methods. This makes their usefulness depend on whether they stay a realistic proxy for the systems we care ...

Read the original article