AI Interpretability

Feeds to Scour
SubscribedAll
Scoured 45 posts in 34.0 ms

Shared Semantics, Divergent Mechanisms: Unsupervised Feature Discovery by Aligning Semantics and Mechanisms

 🛡️AI Safety  Content type: Academic
arxiv.org·

Exploration of a DNA Sequencing Basecaller using Activation Patching

 🛡️AI Safety
lesswrong.com·

The Standard Interpretable Model: A general theory of interpretable machine learning to deductively design interpretable methods using Lagrangian mechanics

 🛡️AI Safety  Content type: Academic
arxiv.org·

Sparse probes and murky physics: a case study of interpretability challenges in a foundation model for continuum dynamics

 🛡️AI Safety  Content type: Academic
arxiv.org·

Query Lens: Interpreting Sparse Key-Value Features with Indirect Effects

 🕸️Sparse Embeddings  Content type: Academic
arxiv.org·

Decoding Multimodal Cues: Unveiling the Implicit Meaning Behind Hateful Videos

 🪄Prompt Engineering  Content type: Academic
arxiv.org·

One Lens, Many Worlds : A Capability-Typed Interface for World-Model Interpretability

 🎯Qdrant  Content type: Academic
arxiv.org·

Unstable Features, Reproducible Subspaces: Understanding Seed Dependence in Sparse Autoencoders

 🕸️Sparse Vectors  Content type: Academic
arxiv.org·

A Geometric View for Understanding Concept Learning and Neuron Interpretation in Sparse Autoencoders

 📊Embeddings  Content type: Academic
arxiv.org·

Bag of Dims: Training-Free Mechanistic Interpretability via Dimension-Level Sign Patterns

 🤖AI  Content type: Academic
arxiv.org·

ICA Lens: Interpreting Language Models Without Training Another Dictionary

 🤖AI  Content type: Academic

When Attribution Patching Lies: Diagnosis and a Second-Order Correction

 🛡️AI Safety  Content type: Academic
arxiv.org·

Trajectory Geometry of Transformer Representations Across Layers

 🦉Qwen  Content type: Academic
arxiv.org·

Interpreting and Steering a Text-to-Speech Language Model with Sparse Autoencoders

 🕸️Sparse Vectors  Content type: Academic
arxiv.org·

VFUSE: Virulent Feature Understanding with Sparse autoEncoders

 🆕New AI  Content type: Academic
arxiv.org·

Inside the Visual Mind: Neuroscience-Motivated Concept Circuits for Interpreting and Steering Vision Transformers

 🛡️AI Safety  Content type: Academic
arxiv.org·

Sparse Autoencoders Reveal Interpretable and Steerable Features in VLA Models

 🕸️Sparse Vectors  Content type: Academic
arxiv.org·

Interactions Between Crosscoder Features: A Compact Proofs Perspective

 🛡️AI Safety  Content type: Academic
arxiv.org·

Pre-Intervention Prediction of Sparse Autoencoder Steering Side Effects

 🤖AI  Content type: Academic
arxiv.org·

Closure-Validated Circuit Discovery in Attention Heads: Co-activation Proposes, Ablation Disposes

 🧩MoE  Content type: Academic
arxiv.org·

Keyboard Shortcuts

Navigation

Next / previous item
j/k
Open post
oorEnter
Preview post
v

Post Actions

Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s

Recommendations

Add interest / feed
Enter
Not interested
x

Go to

Home
gh
Interests
gi
Feeds
gf
Likes
gl
History
gy
Changelog
gc
Settings
gs
Browse
gb
Search
/

General

Show this help
?
Submit feedback
!
Close modal / unfocus
Esc

Press ? anytime to show this help