Interpretability

Feeds to Scour
SubscribedAll
Scoured 80 posts in 7.1 ms

Pre-Intervention Prediction of Sparse Autoencoder Steering Side Effects

馃挰LLMsContent type: Academic
arxiv.org

Defeating Introspection Adapters (and Why Threat Models Matter)

馃挰LLMs
lesswrong.com

Ablation-Reversible Heads Don't Transfer: A Stress Test for Mechanistic Role Claims in Transformers

馃攧TransformersContent type: Academic
arxiv.org

Subspace-Aware Sparse Autoencoders for Effective Mechanistic Interpretability

馃挰LLMsContent type: Academic
arxiv.org

When Attribution Patching Lies: Diagnosis and a Second-Order Correction

馃攧TransformersContent type: Academic
arxiv.org

Can activation verbalizers surface an internal chain of thought?

馃挰LLMs
lesswrong.com

Dual-Stance Evaluation of Sycophancy: The Structure of Agreement and the Limits of Intervention

馃挰LLMsContent type: Academic
arxiv.org

SAEExplainer: Interpreting SAE Features with Activation-Guided Preference Optimization

馃攧TransformersContent type: Academic
arxiv.org

Building Better Activation Oracles

鈿欙笍Model Training
lesswrong.com

Wearable Single-Lead ECG Detects Fine-Grained Structural Heart Disease Through Echo-Report Supervision

鈿欙笍Model TrainingContent type: Academic
arxiv.org

Inside the Visual Mind: Neuroscience-Motivated Concept Circuits for Interpreting and Steering Vision Transformers

馃攧TransformersContent type: Academic
arxiv.org

Where does Absolute Position come from in decoder-only Transformers?

馃攧TransformersContent type: Academic
arxiv.org

Adversarial Robustness of Activation Steering in Large Language Models

馃挰LLMsContent type: Academic
arxiv.org

Two More Methods for Consistency Training and Some New Ways to Apply It

馃AI Research
lesswrong.com

MIRAGE: A Polarity-Flipping Encoding Subspace in LLM Agents

馃AI AgentsContent type: Academic
arxiv.org

A Geometric View for Understanding Concept Learning and Neuron Interpretation in Sparse Autoencoders

馃搲Deep LearningContent type: Academic
arxiv.org

DiffoR: A Unified Continuous Generative Framework for Universal Ordinal Regression

馃搻Scaling LawsContent type: Academic
arxiv.org

Why Limit the Residual Stream to Layers and Not Tokens? Persistent Memory for Continuous Latent Reasoning

馃挰LLMsContent type: Academic
arxiv.org

LLM Self-Recognition: Steering and Retrieving Activation Signatures

馃AI ResearchContent type: Academic
arxiv.org

Temporal Preference Concepts and their Functions in a Large Language Model

馃挰LLMsContent type: Academic
arxiv.org
Sign up or log in to see more results

Keyboard Shortcuts

Navigation

Next / previous item
j/k
Open post
oorEnter
Preview post
v

Post Actions

Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s

Recommendations

Add interest / feed
Enter
Not interested
x

Go to

Home
gh
Interests
gi
Feeds
gf
Likes
gl
History
gy
Changelog
gc
Settings
gs
Browse
gb
Search
/

General

Show this help
?
Submit feedback
!
Close modal / unfocus
Esc

Press ? anytime to show this help