High-level approaches to rigor in interpretability
lesswrong.com·1h
🔍AI Interpretability
Preview
Report Post

Published on December 8, 2025 8:46 PM GMT

There are three broad types of approach I see for making interpretability rigorous.  I’ve put them in ascending order of how much assurance I think they can provide.  I think they all have pros and cons, and am generally in favor of rigor.

  1. (weakest) Practical utility: Does this interpretability technique help solve problems we care about, such as jailbreaks?
    1. This is the easiest to test for, but also doesn’t directly bear on the question of “does the system work the way we think it does” that mechanistic interpretability is supposedly all about.
    2. I think of Cas as having pushed this approach in the AI safety community.  From talking t…

Similar Posts

Loading similar posts...