Debugging misaligned completions with sparse-autoencoder latent attribution
alignment.openai.com·1h·
Discuss: Hacker News
📊Learned Metrics
Preview
Report Post

← Back to OpenAI Alignment Blog

Dec 1, 2025 · Tom Dupre la Tour and Dan Mossing, in collaboration with the Interpretability team

We use interpretability tools to study mechanisms underlying misalignment in language models. In previous work (Wang et al., 2025), we used a model diffing approach to study the mechanism of emergent misalignment (Betley et al., 2025) using sparse-autoencoders (SAEs) (Cunningham et al., 2023, Bricken et al., 2023, Gao et al., 2024).[1]

Specifically, we used a two-step model-diffing approach to compa…

Similar Posts

Loading similar posts...