The corrigibility basin of attraction is a misleading gloss
lesswrong.com·1d
🔄Eventual Consistency
Preview
Report Post

Published on December 6, 2025 3:38 PM GMT

The idea of a “basin of attraction around corrigibility” motivates much of prosaic alignment research. Essentially this is an abstract way of thinking about the process of iteration on AGI designs. Engineers test to find problems, then understand the problems, then design fixes. The reason we need corrigibility for this is that a non-corrigible agent generally has incentives to interfere with this process. The concept was introduced by Paul Christiano:

… a corrigible agent prefers to build other agents that share the overseer’s preferences — even if the agent doesn’t yet share the overseer’s preferences perfectly. After all, even if you only appr…

Similar Posts

Loading similar posts...