Methodological considerations in making malign initializations for control research
lesswrong.com·3d
🛡️AI Security
Preview
Report Post

Published on December 24, 2025 1:18 AM GMT

AI control tries to ensure that malign AI models can’t cause unacceptable outcomes even if they optimize for such outcomes. AI control evaluations use a red-team–blue-team methodology to measure the efficacy of a set of control measures. Specifically, the red team creates a “malign initialization” (malign init) of the AI model which the red team optimizes for making the deployment go poorly[1]. The blue team then deploys this malign init to complete some tasks, and applies som…

Similar Posts

Loading similar posts...