Steering Directions Are Explanations, Not Handles (opens in new tab)

Covers 2 stories including Current LLMs are the future? No ways man! Look at Mamba: Selective State Spaces

A direction can be interpretable, causal, and predictive (and still a bad handle for intervention)TLDR: In modern interpretability, we find “directions” inside a language model that seem to encode meaningful concepts. One such example is “This is about food.” A common next step is to steer the model in that direction, in the hopes it produces more of that concept. I show that even when a direction passes the tests we have for whether it “really” means something, the range over which the idea ...

Read the original article