Published on December 8, 2025 8:46 PM GMT

There are three broad types of approach I see for making interpretability rigorous.  I’ve put them in ascending order of how much assurance I think they can provide.  I think they all have pros and cons, and am generally in favor of rigor.

  1. (weakest) Practical utility: Does this interpretability technique help solve problems we care about, such as jailbreaks?
    1. This is the easiest to test for, but also doesn’t directly bear on the question of “does the system work the way we think it does” that mechanistic interpretability is supposedly all about.
    2. I think of Cas as having pushed this approach in the AI safety community.  From talking t…

Similar Posts

Loading similar posts...

Keyboard Shortcuts

Navigation
Next / previous item
j/k
Open post
oorEnter
Preview post
v
Post Actions
Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Recommendations
Add interest / feed
Enter
Not interested
x
Go to
Home
gh
Interests
gi
Feeds
gf
Likes
gl
History
gy
Changelog
gc
Settings
gs
Browse
gb
Search
/
General
Show this help
?
Submit feedback
!
Close modal / unfocus
Esc

Press ? anytime to show this help