ICA Lens: Interpreting Language Models Without Training Another Dictionary (opens in new tab) 🤖AI Content type: Academic

arxiv.org··Covered by ai-brief.liziran.com·Open original

Finding interpretable directions in language-model representations is critical for understanding and controlling model behavior. Sparse autoencoders (SAEs) have become the standard tool for this purpose, but using them as the default first lens often requires training, storing, and evaluating large overcomplete dictionaries. This bottleneck limits rapid exploration and raises a fundamental question: how much interpretable structure is already vi...

Read the original article

Sign in to keep reading the full article.

Sign Up Log In

Cited by 1 article

In other languages

Arbor科研增益2.5倍，50环境抵300个

ai-brief.liziran.com·