🔍 AI Interpretability - jimman

Interpretable enzyme function prediction via sparse autoencoder features of ESMC across the microbial protein universe

Sparse probes and murky physics: a case study of interpretability challenges in a foundation model for continuum dynamics

⚡Model Efficiency Academic

arxiv.org·

Interpreting and Steering a Text-to-Speech Language Model with Sparse Autoencoders

⚡LLM Optimization Academic

arxiv.org·

Decoding Multimodal Cues: Unveiling the Implicit Meaning Behind Hateful Videos

✍️Prompt Engineering Academic

arxiv.org·

ICA Lens: Interpreting Language Models Without Training Another Dictionary

⚡LLM Optimization Academic

arxiv.org·

Trajectory Geometry of Transformer Representations Across Layers

⚡LLM Optimization Academic

arxiv.org·

One Lens, Many Worlds : A Capability-Typed Interface for World-Model Interpretability

⚡LLM Optimization Academic

arxiv.org·

The Standard Interpretable Model: A general theory of interpretable machine learning to deductively design interpretable methods using Lagrangian mechanics

⚡LLM Optimization Academic

arxiv.org·

SAEExplainer: Interpreting SAE Features with Activation-Guided Preference Optimization

⚡Model Efficiency Academic

arxiv.org·

Subspace-Aware Sparse Autoencoders for Effective Mechanistic Interpretability

⚡LLM Optimization Academic

arxiv.org·

Shared Semantics, Divergent Mechanisms: Unsupervised Feature Discovery by Aligning Semantics and Mechanisms

⚡LLM Optimization Academic

arxiv.org·

A Geometric View for Understanding Concept Learning and Neuron Interpretation in Sparse Autoencoders

⚡LLM Optimization Academic

arxiv.org·

Mechanistic Analysis of Alignment Algorithms in Language Models

⚡LLM Optimization Academic

arxiv.org·

Steering Multirobot Behavior via Closed-Loop Affine Activation Editing

✍️Prompt Engineering Academic

arxiv.org·

Output Type Before Quality: A Standards-Derived XAI Admissibility Rubric for Autonomous-Driving Safety

✍️Prompt Engineering Academic

arxiv.org·

Closure-Validated Circuit Discovery in Attention Heads: Co-activation Proposes, Ablation Disposes

✍️Prompt Engineering Academic

arxiv.org·

Shared Latent Structures Enable Unified Backdoor Detection and Mitigation in LLMs

🤖AI Academic

arxiv.org·

Interpretable enzyme function prediction via sparse autoencoder features of ESMC across the microbial protein universe

When Attribution Patching Lies: Diagnosis and a Second-Order Correction

scMTG reconstructs single-cell temporal dynamics with Markov transition generators

Pre-Intervention Prediction of Sparse Autoencoder Steering Side Effects

Sparse probes and murky physics: a case study of interpretability challenges in a foundation model for continuum dynamics

Interpreting and Steering a Text-to-Speech Language Model with Sparse Autoencoders

Decoding Multimodal Cues: Unveiling the Implicit Meaning Behind Hateful Videos

ICA Lens: Interpreting Language Models Without Training Another Dictionary

Trajectory Geometry of Transformer Representations Across Layers

One Lens, Many Worlds : A Capability-Typed Interface for World-Model Interpretability

The Standard Interpretable Model: A general theory of interpretable machine learning to deductively design interpretable methods using Lagrangian mechanics

SAEExplainer: Interpreting SAE Features with Activation-Guided Preference Optimization

Subspace-Aware Sparse Autoencoders for Effective Mechanistic Interpretability

Shared Semantics, Divergent Mechanisms: Unsupervised Feature Discovery by Aligning Semantics and Mechanisms

A Geometric View for Understanding Concept Learning and Neuron Interpretation in Sparse Autoencoders

Mechanistic Analysis of Alignment Algorithms in Language Models

Steering Multirobot Behavior via Closed-Loop Affine Activation Editing

Output Type Before Quality: A Standards-Derived XAI Admissibility Rubric for Autonomous-Driving Safety

Closure-Validated Circuit Discovery in Attention Heads: Co-activation Proposes, Ablation Disposes

Shared Latent Structures Enable Unified Backdoor Detection and Mitigation in LLMs