Interpretability

Feeds to Scour
SubscribedAll
Scoured 40 posts in 7.0 ms

Subspace-Aware Sparse Autoencoders for Effective Mechanistic Interpretability

ย ๐Ÿ’ฌLLMs ย Content type: Academic
arxiv.orgยท

The Standard Interpretable Model: A general theory of interpretable machine learning to deductively design interpretable methods using Lagrangian mechanics

ย ๐Ÿ–ฅ๏ธML Systems ย Content type: Academic
arxiv.orgยท

Trajectory Geometry of Transformer Representations Across Layers

ย ๐Ÿ”„Transformers ย Content type: Academic
arxiv.orgยท

Sparse probes and murky physics: a case study of interpretability challenges in a foundation model for continuum dynamics

ย ๐Ÿง AI Research ย Content type: Academic
arxiv.orgยท

Shared Latent Structures Enable Unified Backdoor Detection and Mitigation in LLMs

ย ๐Ÿ’ฌLLMs ย Content type: Academic
arxiv.orgยท

Inside the Visual Mind: Neuroscience-Motivated Concept Circuits for Interpreting and Steering Vision Transformers

ย ๐Ÿ”„Transformers ย Content type: Academic
arxiv.orgยท

Phase Transitions in Attention: A Bayesian Theory of Copy Head Emergence

ย ๐Ÿ”„Transformers ย Content type: Academic
arxiv.orgยท

Sparse Autoencoders Reveal Interpretable and Steerable Features in VLA Models

ย ๐Ÿ“‰Deep Learning ย Content type: Academic
arxiv.orgยท

ICA Lens: Interpreting Language Models Without Training Another Dictionary

ย ๐Ÿ’ฌLLMs ย Content type: Academic
arxiv.orgยท

Where does Absolute Position come from in decoder-only Transformers?

ย ๐Ÿ”„Transformers ย Content type: Academic
arxiv.orgยท

Closure-Validated Circuit Discovery in Attention Heads: Co-activation Proposes, Ablation Disposes

ย ๐Ÿ”„Transformers ย Content type: Academic
arxiv.orgยท

Unstable Features, Reproducible Subspaces: Understanding Seed Dependence in Sparse Autoencoders

ย ๐Ÿ“‰Deep Learning ย Content type: Academic
arxiv.orgยท

Shared Semantics, Divergent Mechanisms: Unsupervised Feature Discovery by Aligning Semantics and Mechanisms

ย ๐Ÿง AI Research ย Content type: Academic
arxiv.orgยท

A Geometric View for Understanding Concept Learning and Neuron Interpretation in Sparse Autoencoders

ย ๐Ÿ“‰Deep Learning ย Content type: Academic
arxiv.orgยท

Pre-Intervention Prediction of Sparse Autoencoder Steering Side Effects

ย ๐Ÿ’ฌLLMs ย Content type: Academic
arxiv.orgยท

Steering Multirobot Behavior via Closed-Loop Affine Activation Editing

ย ๐ŸŽฎReinforcement Learning ย Content type: Academic
arxiv.orgยท

Ablation-Reversible Heads Don't Transfer: A Stress Test for Mechanistic Role Claims in Transformers

ย ๐Ÿ”„Transformers ย Content type: Academic
arxiv.orgยท

LLM Self-Recognition: Steering and Retrieving Activation Signatures

ย ๐Ÿง AI Research ย Content type: Academic
arxiv.orgยท

Temporal Preference Concepts and their Functions in a Large Language Model

ย ๐Ÿ’ฌLLMs ย Content type: Academic
arxiv.orgยท

SAEExplainer: Interpreting SAE Features with Activation-Guided Preference Optimization

ย ๐Ÿ”„Transformers ย Content type: Academic
arxiv.orgยท

Keyboard Shortcuts

Navigation

Next / previous item
j/k
Open post
oorEnter
Preview post
v

Post Actions

Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s

Recommendations

Add interest / feed
Enter
Not interested
x

Go to

Home
gh
Interests
gi
Feeds
gf
Likes
gl
History
gy
Changelog
gc
Settings
gs
Browse
gb
Search
/

General

Show this help
?
Submit feedback
!
Close modal / unfocus
Esc

Press ? anytime to show this help