One sentence pitch: our goal is to develop a piece of epistemic infrastructure for iteratively and empirically answering interpretive questions about AI models, where the accumulation of empirics leads to resolution of interpretive ambiguity and/or calibration of uncertainty.This directly builds on our “performative misalignment” work (), which we see as a minimal demonstration of one round of debate.The difficulty and importance of interpretive questionsAI safety involves a lot of, interpret...

Read the original article