The safe-to-dangerous shift is a fundamental problem for eval realism; but also for measuring awareness (opens in new tab)

Covers 2 stories including Alignment faking in large language models

1) The safe-to-dangerous shift is a fundamental problem for eval realismSuppose we have a capable and potentially scheming model, and before we deploy it, we want some evidence that it won’t do anything catastrophically dangerous once we deploy it. A common approach is to use black-box alignment evaluations. However, alignment evaluations are only reassuring to the extent that the model can't reliably[1] distinguish the deployment distribution from the evaluation distribution, as it is otherw...

Read the original article