Natural language autoencoders (NLAs) convert opaque AI activations into legible text explanations. These explanations aren’t perfect, but they’re often useful. (opens in new tab)

Natural language autoencoders (NLAs) convert opaque AI activations into legible text explanations. These explanations aren’t perfect, but they’re often useful. For example: NLAs show that, when asked to complete a couplet, Claude plans possible rhymes in advance: