TL;DR Recent studies by Anthropic show that LLM features extracted via mechanistic interpretability fall into distinct categories, each with different properties. However, state-of-the-art auto-interpreters fail to account for this variety. In this article, I propose AIR (Auto-Interpretability Router). AIR is a new protocol that uses a sentence embedder to identify a feature's category and, based on the category, routes the most appropriate activation examples to the auto-interpreter. Results...

Read the original article