⚡Fast AI Inference transformer-circuits.pub

Natural Language Autoencoders Produce Unsupervised Explanations of LLM Activations (opens in new tab)

Covered by 3 sources including DEV Community, lesswrong.comDiscussed on Hacker News

We introduce Natural Language Autoencoders (NLAs), an unsupervised method for generating natural language explanations of LLM activations. An NLA consists of two LLM modules: an activation verbalizer (AV) that maps an activation to a text description and an activation reconstructor (AR) that maps the description back to an activation. We jointly train the AV and AR with reinforcement learning to reconstruct residual stream activations. Although we optimize for activation reconstruction, the r...

Natural Language Autoencoders Produce Unsupervised Explanations of LLM Activations (opens in new tab)

Covered in 5 articles

Anthropic NLAs: Reading Claude's Hidden Thoughts in 2026

[Linkpost] How Transparent Is DiffusionGemma (and why it matters)

SAE It Across Models: Explaining Features With Foreign NLA Verbalizers