Interpretable Embeddings with Sparse Autoencoders: A Data Analysis Toolkit
arxiv.org·1d
📝Text Embeddings
Preview
Report Post

Title:Interpretable Embeddings with Sparse Autoencoders: A Data Analysis Toolkit

View PDF HTML (experimental)

Abstract:Analyzing large-scale text corpora is a core challenge in machine learning, crucial for tasks like identifying undesirable model behaviors or biases in training data. Current methods often rely on costly LLM-based techniques (e.g. annotating dataset differences) or dense embedding models (e.g. for clustering), which lack control over the properties of interest. We propose using sparse autoencoders (SAEs) to create SAE embeddings: representations whose dimensions map to interpretable concepts. Through four data analysis tasks, we show that SAE embeddings are more cost-effective …

Similar Posts

Loading similar posts...