The Future of Interpretability is Geometric
lesswrong.com·11h
Flag this post

Published on October 24, 2025 6:32 PM GMT

TLDR: Anthropic’s recent paper “When Models Manipulate Manifolds” proves that there are meaningful insights in the geometries of LLM activation space. This matches my intuition on the direction of interpretability research, and I think AI safety researchers should build tools to uncover more geometric insights. This paper is a big deal for interpretability research because it shows that these insights exist. Most importantly, it demonstrates the importance of creating unsupervised tools for detecting complex geometric structures that SAEs alone cannot reveal. 


Background

Interpretability researchers have lon…

Similar Posts

Loading similar posts...