🔬 Mech Interp - taylor · Scour

SAEExplainer: Interpreting SAE Features with Activation-Guided Preference Optimization

🔍Interpretability Academic

Less-relevant results

Coelho Mollo and Millière: The Vector Grounding Problem

🛡️AI Safety

philosophyofbrains.com·

Whisper Hallucination Detection and Mitigation via Hidden Representation Steering and Sparse AutoEncoders

🔍Interpretability Academic

The technical community can't be the main character in AI safety anymore

🛡️AI Safety

substackcdn.com··Substack

Shared Latent Structures Enable Unified Backdoor Detection and Mitigation in LLMs

🔍Interpretability Academic

One Lens, Many Worlds : A Capability-Typed Interface for World-Model Interpretability

🔍Interpretability Academic

Ablation-Reversible Heads Don't Transfer: A Stress Test for Mechanistic Role Claims in Transformers

🔍Interpretability Academic

Who Elected Anthropic?

🛡️AI Safety Blog

vizierprime.substack.com··Substack

Mechanistic Analysis of Alignment Algorithms in Language Models

🔍Interpretability Academic

Trajectory Geometry of Transformer Representations Across Layers

🔍Interpretability Academic

Vision-Language Asymmetry in Bistable Image Captioning

🔍Interpretability Academic

TEVI: Text-Conditioned Editing of Visual Representations via Sparse Autoencoders for Improved Vision-Language Alignment

🔍Interpretability Academic

The Tell-Tale Norm: $\ell_2$ Magnitude as a Signal for Reasoning Dynamics in Large Language Models

🔍Interpretability Academic

Shared Semantics, Divergent Mechanisms: Unsupervised Feature Discovery by Aligning Semantics and Mechanisms

🔍Interpretability Academic

Activation-Based Active Learning for In-Context Learning: Challenges and Insights

🔍Interpretability Academic

Interpreting Brain Responses to Language with Sparse Features from Language Models

🔍Interpretability Academic

Temporal Preference Concepts and their Functions in a Large Language Model

🔍Interpretability Academic

Position: Don't Just "Fix it in Post": A Science of AI Must Study Training Dynamics

🛡️AI Safety Academic

Set-Based Transformer for Atmospheric Compensation in Standoff LWIR Hyperspectral Imaging

🔍Interpretability Academic

Selection-Aware Diagnostics for Chain-of-Thought Answer Hijacking

✅Formal Verification Academic

Log in to enable infinite scrolling