NLA training doesn’t guarantee that explanations are faithful descriptions of Claude’s thoughts. But based on experience and experimental evidence, we think the... (opens in new tab)

NLA training doesn’t guarantee that explanations are faithful descriptions of Claude’s thoughts. But based on experience and experimental evidence, we think they often are. For instance, we find that NLAs help discover hidden motivations in an intentionally misaligned model.