Developmental Cognitive Interpretability: A Research Agenda for Modelling Generalisation and Predicting Agent Behaviour (opens in new tab)

Covers Teaching Claude Why

SummarySafe deployment of an AI system requires that we can make confident claims about its behaviour on out-of-distribution deployment inputs on the basis of only pre-deployment evaluations. One approach to making such claims is to take a , in which we interpret the AIs behaviour in terms of latent cognitive constructs, such as Developmental Cognitive Interpretability (DCI), which aims to model how cognitive constructs change over the course of training. Further, by understanding how cogniti...

Read the original article