Making Linear Probes Interpretable
lesswrong.com·21h
💻Tech
Preview
Report Post

Published on December 18, 2025 1:48 AM GMT

Alright so I’ve been messing around with LLMs for a few weeks now. SAE features are supposed to be interpretable, but when I wanted to directly attack an AI’s own ontology, the whole thing kinda broke down. 

Linear probes find directions that work, but I didn’t know WHY they work. So I combined them.

I train probes on SAE feature activations instead of raw activations. Now the probe weights tell you exactly which features matter and how much. 

TL;DR

  • Get your contrastive examples
  • Get your SAE features
  • Train a sparse linear probe with Elasticnet (most features become zero straight away)
  • The probe weights tell you which features matter and which ones don’t
  • Build a st…

Similar Posts

Loading similar posts...