🔍 Interpretability - Bingran · Scour

mingusb/transformer-golf: The Fully Unrolled Transformer: An experimental repository for architecture simplification and compilation. [2026]

📉Deep Learning Code

github.com··Hacker News

Sparse probes and murky physics: a case study of interpretability challenges in a foundation model for continuum dynamics

🧠AI Research Academic

Shared Latent Structures Enable Unified Backdoor Detection and Mitigation in LLMs

💬LLMs Academic

The Residual Stream Has a Geometry of Time

📈Quantitative Finance

lesswrong.com·

Less-relevant results

The technical community can't be the main character in AI safety anymore

substackcdn.com··Substack

One Lens, Many Worlds : A Capability-Typed Interface for World-Model Interpretability

🎮Reinforcement Learning Academic

Phase Transitions in Attention: A Bayesian Theory of Copy Head Emergence

🔄Transformers Academic

VFUSE: Virulent Feature Understanding with Sparse autoEncoders

💬LLMs Academic

SAE It Across Models: Explaining Features With Foreign NLA Verbalizers

lesswrong.com·

ICA Lens: Interpreting Language Models Without Training Another Dictionary

💬LLMs Academic

Interactions Between Crosscoder Features: A Compact Proofs Perspective

🧠AI Research Academic

Unstable Features, Reproducible Subspaces: Understanding Seed Dependence in Sparse Autoencoders

📉Deep Learning Academic

Harmfulness Directions in OLMo

⚙️Model Training

lesswrong.com·

Mechanistic Analysis of Alignment Algorithms in Language Models

🎮Reinforcement Learning Academic

[Paper] Dictionary Learning Identifiability for Understanding SAEs

📐Scaling Laws

lesswrong.com·

Sparse Autoencoders Reveal Interpretable and Steerable Features in VLA Models

📉Deep Learning Academic

Closure-Validated Circuit Discovery in Attention Heads: Co-activation Proposes, Ablation Disposes

🔄Transformers Academic

Shared Semantics, Divergent Mechanisms: Unsupervised Feature Discovery by Aligning Semantics and Mechanisms

🧠AI Research Academic

Steering Multirobot Behavior via Closed-Loop Affine Activation Editing

🎮Reinforcement Learning Academic

Analysis of Metastable States in the Transformer Activation Space

🔄Transformers

lesswrong.com·

Log in to enable infinite scrolling