Published on December 8, 2025 8:46 PM GMT
There are three broad types of approach I see for making interpretability rigorous. I’ve put them in ascending order of how much assurance I think they can provide. I think they all have pros and cons, and am generally in favor of rigor.
- (weakest) Practical utility: Does this interpretability technique help solve problems we care about, such as jailbreaks?
- This is the easiest to test for, but also doesn’t directly bear on the question of “does the system work the way we think it does” that mechanistic interpretability is supposedly all about.
- I think of Cas as having pushed this approach in the AI safety community. From talking t…
Published on December 8, 2025 8:46 PM GMT
There are three broad types of approach I see for making interpretability rigorous. I’ve put them in ascending order of how much assurance I think they can provide. I think they all have pros and cons, and am generally in favor of rigor.
- (weakest) Practical utility: Does this interpretability technique help solve problems we care about, such as jailbreaks?
- This is the easiest to test for, but also doesn’t directly bear on the question of “does the system work the way we think it does” that mechanistic interpretability is supposedly all about.
- I think of Cas as having pushed this approach in the AI safety community. From talking to Leo Gao, it seems like GDM has recently pivoted in this direction.
- (stronger) Simulatability: Can a human equipped with this interpretability techniques make better predictions about a given system’s behavior?
- This seems very underexplored from what I can tell.
- Running experiments like this are costly because of the use of human subjects.
- Ultimately, this isn’t satisfying because a tool could improve a human user’s predictions through manipulation rather than by informing them.
- (strongest) Principled theoretical understanding: have we developed rigorous mathematical definitions with satisfying conceptual properties and shown that systems meet them?
- Causal scrubbing and Atticus Geiger’s work are examples of such ideas; neither is satisfactory.
- Strict definitions of interpretability are probably completely intractable to satisfy, but we could hope to characterize conditions under which approximations are “good enough” according to various criteria.
Some random extra context:
I have a bit of a reputation as a skeptic/hater of mechanistic interpretability in the safety community. This is not entirely unearned, I’m largely born out of an impression that much of the early work lacked rigor, and was basically a bunch of “just-so stories”. Colleagues began telling me that this was clearly no longer the case starting with the circuits thread, and I’ve definitely noticed an improvement.
Discuss