Automated alignment is harder than you think (opens in new tab)

Covered by 5 sources including lesswrong.com, thezvi.substack.com

A leading proposal for aligning artificial superintelligence (ASI) is to use AI agents to automate an increasing fraction of alignment research as capabilities improve. We argue that, even when research agents are not scheming to deliberately sabotage alignment work, this plan could produce compelling but catastrophically misleading safety assessments resulting in the unintentional deployment of misaligned AI. This could happen because alignment...

Read the original article

Sign in to keep reading the full article.

Sign Up Log In

Covered in 8 articles

lesswrong.com·

AI #172: The First Fable

lesswrong.com·

Sequent: scale and automation for higher confidence in alignment

lesswrong.com·

Which technical AI safety fields are going to be automated first?

View all 8 ›