Mech Interp

mechanistic interpretability, circuits, superposition, feature visualization, AI interpretability

Feeds to Scour
SubscribedAll
Scoured 53 posts in 6.1 ms

SAEExplainer: Interpreting SAE Features with Activation-Guided Preference Optimization

 🔍Interpretability  Content type: Academic
arxiv.org·
Less-relevant results

Coelho Mollo and Millière: The Vector Grounding Problem

 🛡️AI Safety

Whisper Hallucination Detection and Mitigation via Hidden Representation Steering and Sparse AutoEncoders

 🔍Interpretability  Content type: Academic
arxiv.org·

The technical community can't be the main character in AI safety anymore

 🛡️AI Safety
substackcdn.com··Substack

Shared Latent Structures Enable Unified Backdoor Detection and Mitigation in LLMs

 🔍Interpretability  Content type: Academic
arxiv.org·

One Lens, Many Worlds : A Capability-Typed Interface for World-Model Interpretability

 🔍Interpretability  Content type: Academic
arxiv.org·

Ablation-Reversible Heads Don't Transfer: A Stress Test for Mechanistic Role Claims in Transformers

 🔍Interpretability  Content type: Academic
arxiv.org·

Who Elected Anthropic?

 🛡️AI Safety  Content type: Blog

Mechanistic Analysis of Alignment Algorithms in Language Models

 🔍Interpretability  Content type: Academic
arxiv.org·

Trajectory Geometry of Transformer Representations Across Layers

 🔍Interpretability  Content type: Academic
arxiv.org·

Vision-Language Asymmetry in Bistable Image Captioning

 🔍Interpretability  Content type: Academic
arxiv.org·

TEVI: Text-Conditioned Editing of Visual Representations via Sparse Autoencoders for Improved Vision-Language Alignment

 🔍Interpretability  Content type: Academic
arxiv.org·

The Tell-Tale Norm: $\ell_2$ Magnitude as a Signal for Reasoning Dynamics in Large Language Models

 🔍Interpretability  Content type: Academic
arxiv.org·

Shared Semantics, Divergent Mechanisms: Unsupervised Feature Discovery by Aligning Semantics and Mechanisms

 🔍Interpretability  Content type: Academic
arxiv.org·

Activation-Based Active Learning for In-Context Learning: Challenges and Insights

 🔍Interpretability  Content type: Academic
arxiv.org·

Interpreting Brain Responses to Language with Sparse Features from Language Models

 🔍Interpretability  Content type: Academic
arxiv.org·

Temporal Preference Concepts and their Functions in a Large Language Model

 🔍Interpretability  Content type: Academic
arxiv.org·

Position: Don't Just "Fix it in Post": A Science of AI Must Study Training Dynamics

 🛡️AI Safety  Content type: Academic
arxiv.org·

Set-Based Transformer for Atmospheric Compensation in Standoff LWIR Hyperspectral Imaging

 🔍Interpretability  Content type: Academic
arxiv.org·

Selection-Aware Diagnostics for Chain-of-Thought Answer Hijacking

 Formal Verification  Content type: Academic
arxiv.org·

Keyboard Shortcuts

Navigation

Next / previous item
j/k
Open post
oorEnter
Preview post
v

Post Actions

Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s

Recommendations

Add interest / feed
Enter
Not interested
x

Go to

Home
gh
Interests
gi
Feeds
gf
Likes
gl
History
gy
Changelog
gc
Settings
gs
Browse
gb
Search
/

General

Show this help
?
Submit feedback
!
Close modal / unfocus
Esc

Press ? anytime to show this help