🔍 AI Interpretability - justjcullen

A Geometric View for Understanding Concept Learning and Neuron Interpretation in Sparse Autoencoders

🤖Machine Learning Academic

arxiv.org·

ICA Lens: Interpreting Language Models Without Training Another Dictionary

🕸Knowledge Graphs Academic

arxiv.org··Cited by 1 article

Pre-Intervention Prediction of Sparse Autoencoder Steering Side Effects

🤖Machine Learning Academic

arxiv.org·

Interpreting and Steering a Text-to-Speech Language Model with Sparse Autoencoders

🎯Reinforcement Learning Academic

arxiv.org·

The Standard Interpretable Model: A general theory of interpretable machine learning to deductively design interpretable methods using Lagrangian mechanics

🤖Machine Learning Academic

arxiv.org·

Inside the Visual Mind: Neuroscience-Motivated Concept Circuits for Interpreting and Steering Vision Transformers

🗃️Zettelkasten Academic

arxiv.org·

Interactions Between Crosscoder Features: A Compact Proofs Perspective

🤖Machine Learning Academic

arxiv.org·

Layer-Resolved Optimal Transport for Hallucination Detection in NMT and Abstractive Summarization

🗃️Zettelkasten Academic

arxiv.org·

Trajectory Geometry of Transformer Representations Across Layers

⚡Incremental Computation Academic

arxiv.org·

Steering Multirobot Behavior via Closed-Loop Affine Activation Editing

⚡Incremental Computation Academic

arxiv.org·

One Lens, Many Worlds : A Capability-Typed Interface for World-Model Interpretability

🎯Reinforcement Learning Academic

arxiv.org·

TEVI: Text-Conditioned Editing of Visual Representations via Sparse Autoencoders for Improved Vision-Language Alignment

🕸Knowledge Graphs Academic

arxiv.org·

SAEExplainer: Interpreting SAE Features with Activation-Guided Preference Optimization

⚡Incremental Computation Academic

arxiv.org·

Mechanistic Analysis of Alignment Algorithms in Language Models

∘Category Theory Academic

arxiv.org·

Whisper Hallucination Detection and Mitigation via Hidden Representation Steering and Sparse AutoEncoders

🎯Reinforcement Learning Academic

arxiv.org·

Closure-Validated Circuit Discovery in Attention Heads: Co-activation Proposes, Ablation Disposes

🗃️Zettelkasten Academic

arxiv.org·

Vision-Language Asymmetry in Bistable Image Captioning

🗃️Zettelkasten Academic

arxiv.org·

When Attribution Patching Lies: Diagnosis and a Second-Order Correction

Bag of Dims: Training-Free Mechanistic Interpretability via Dimension-Level Sign Patterns

Decoding Multimodal Cues: Unveiling the Implicit Meaning Behind Hateful Videos

A Geometric View for Understanding Concept Learning and Neuron Interpretation in Sparse Autoencoders

ICA Lens: Interpreting Language Models Without Training Another Dictionary

Pre-Intervention Prediction of Sparse Autoencoder Steering Side Effects

Interpreting and Steering a Text-to-Speech Language Model with Sparse Autoencoders

The Standard Interpretable Model: A general theory of interpretable machine learning to deductively design interpretable methods using Lagrangian mechanics

Inside the Visual Mind: Neuroscience-Motivated Concept Circuits for Interpreting and Steering Vision Transformers

Interactions Between Crosscoder Features: A Compact Proofs Perspective

Layer-Resolved Optimal Transport for Hallucination Detection in NMT and Abstractive Summarization

Trajectory Geometry of Transformer Representations Across Layers

Steering Multirobot Behavior via Closed-Loop Affine Activation Editing

One Lens, Many Worlds : A Capability-Typed Interface for World-Model Interpretability

TEVI: Text-Conditioned Editing of Visual Representations via Sparse Autoencoders for Improved Vision-Language Alignment

SAEExplainer: Interpreting SAE Features with Activation-Guided Preference Optimization

Mechanistic Analysis of Alignment Algorithms in Language Models

Whisper Hallucination Detection and Mitigation via Hidden Representation Steering and Sparse AutoEncoders

Closure-Validated Circuit Discovery in Attention Heads: Co-activation Proposes, Ablation Disposes

Vision-Language Asymmetry in Bistable Image Captioning