Published on December 18, 2025 1:48 AM GMT

Alright so I’ve been messing around with LLMs for a few weeks now. SAE features are supposed to be interpretable, but when I wanted to directly attack an AI’s own ontology, the whole thing kinda broke down. 

Linear probes find directions that work, but I didn’t know WHY they work. So I combined them.

I train probes on SAE feature activations instead of raw activations. Now the probe weights tell you exactly which features matter and how much. 

TL;DR

  • Get your contrastive examples
  • Get your SAE features
  • Train a sparse linear probe with Elasticnet (most features become zero straight away)
  • The probe weights tell you which features matter and which ones don’t
  • Build a st…

Similar Posts

Loading similar posts...

Keyboard Shortcuts

Navigation
Next / previous item
j/k
Open post
oorEnter
Preview post
v
Post Actions
Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Recommendations
Add interest / feed
Enter
Not interested
x
Go to
Home
gh
Interests
gi
Feeds
gf
Likes
gl
History
gy
Changelog
gc
Settings
gs
Browse
gb
Search
/
General
Show this help
?
Submit feedback
!
Close modal / unfocus
Esc

Press ? anytime to show this help