Interpretability

Feeds to Scour
SubscribedAll
Scoured 69 posts in 5.2 ms

Inside the Visual Mind: Neuroscience-Motivated Concept Circuits for Interpreting and Steering Vision Transformers

馃Cognitive Neurosciens for AIContent type: Academic
arxiv.org

The technical community can't be the main character in AI safety anymore

馃幆Alignment
substackcdn.comSubstack

scMTG reconstructs single-cell temporal dynamics with Markov transition generators

馃捑Memory SystemsContent type: Academic
biorxiv.org

When Attribution Patching Lies: Diagnosis and a Second-Order Correction

馃Cognitive Neurosciens for AIContent type: Academic
arxiv.org
Less-relevant results

AI-augmented coaching platform specifically for dissertation/thesis students

馃幆AlignmentContent type: Discussion

VFUSE: Virulent Feature Understanding with Sparse autoEncoders

馃攳RAGContent type: Academic
arxiv.org

SAE It Across Models: Explaining Features With Foreign NLA Verbalizers

馃帹Multimodal AI
lesswrong.com

Closure-Validated Circuit Discovery in Attention Heads: Co-activation Proposes, Ablation Disposes

馃NeuroscienceContent type: Academic
arxiv.org

Who Elected Anthropic?

馃幆AlignmentContent type: Blog

Ablation-Reversible Heads Don't Transfer: A Stress Test for Mechanistic Role Claims in Transformers

馃Cognitive Neurosciens for AIContent type: Academic
arxiv.org

Interactions Between Crosscoder Features: A Compact Proofs Perspective

馃帹Multimodal AIContent type: Academic
arxiv.org

FoldSAE: Learning to Steer Protein Folding Through Sparse Representations

馃捑Memory SystemsContent type: Academic
arxiv.org

Mechanistic Analysis of Alignment Algorithms in Language Models

馃幆AlignmentContent type: Academic
arxiv.org

Sparse Autoencoders Reveal Interpretable and Steerable Features in VLA Models

馃Embodied AIContent type: Academic
arxiv.org

A Deployment-Oriented Framework for Explainable AI-Assisted eBPF/XDP Mitigation at the IoT Edge

馃幆AlignmentContent type: Academic
arxiv.org

A Geometric View for Understanding Concept Learning and Neuron Interpretation in Sparse Autoencoders

馃Cognitive Neurosciens for AIContent type: Academic
arxiv.org

Pre-Intervention Prediction of Sparse Autoencoder Steering Side Effects

馃Cognitive Neurosciens for AIContent type: Academic
arxiv.org

Whisper Hallucination Detection and Mitigation via Hidden Representation Steering and Sparse AutoEncoders

馃寑HallucinationContent type: Academic
arxiv.org

Subspace-Aware Sparse Autoencoders for Effective Mechanistic Interpretability

馃帹Multimodal AIContent type: Academic
arxiv.org

Inside the LLM Word Factory

馃捑Memory SystemsContent type: Academic
arxiv.org

Keyboard Shortcuts

Navigation

Next / previous item
j/k
Open post
oorEnter
Preview post
v

Post Actions

Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s

Recommendations

Add interest / feed
Enter
Not interested
x

Go to

Home
gh
Interests
gi
Feeds
gf
Likes
gl
History
gy
Changelog
gc
Settings
gs
Browse
gb
Search
/

General

Show this help
?
Submit feedback
!
Close modal / unfocus
Esc

Press ? anytime to show this help