Interpretable and Steerable Concept Bottleneck Sparse Autoencoders
arxiv.org·3d
Information Bottleneck
Preview
Report Post

Title:Interpretable and Steerable Concept Bottleneck Sparse Autoencoders

View PDF HTML (experimental)

Abstract:Sparse autoencoders (SAEs) promise a unified approach for mechanistic interpretability, concept discovery, and model steering in LLMs and LVLMs. However, realizing this potential requires that the learned features be both interpretable and steerable. To that end, we introduce two new computationally inexpensive interpretability and steerability metrics and conduct a systematic analysis on LVLMs. Our analysis uncovers two observations; (i) a majority of SAE neurons exhibit either low interpretability or low steerability or both, rendering them ineffective for downstream use; and (ii) due to…

Similar Posts

Loading similar posts...