🔍 Interpretability - Bingran · Scour

Pre-Intervention Prediction of Sparse Autoencoder Steering Side Effects

💬LLMs Academic

Defeating Introspection Adapters (and Why Threat Models Matter)

lesswrong.com·

Ablation-Reversible Heads Don't Transfer: A Stress Test for Mechanistic Role Claims in Transformers

🔄Transformers Academic

Subspace-Aware Sparse Autoencoders for Effective Mechanistic Interpretability

💬LLMs Academic

When Attribution Patching Lies: Diagnosis and a Second-Order Correction

🔄Transformers Academic

Can activation verbalizers surface an internal chain of thought?

lesswrong.com·

Dual-Stance Evaluation of Sycophancy: The Structure of Agreement and the Limits of Intervention

💬LLMs Academic

SAEExplainer: Interpreting SAE Features with Activation-Guided Preference Optimization

🔄Transformers Academic

Building Better Activation Oracles

⚙️Model Training

lesswrong.com·

Wearable Single-Lead ECG Detects Fine-Grained Structural Heart Disease Through Echo-Report Supervision

⚙️Model Training Academic

Inside the Visual Mind: Neuroscience-Motivated Concept Circuits for Interpreting and Steering Vision Transformers

🔄Transformers Academic

Where does Absolute Position come from in decoder-only Transformers?

🔄Transformers Academic

Adversarial Robustness of Activation Steering in Large Language Models

💬LLMs Academic

Two More Methods for Consistency Training and Some New Ways to Apply It

🧠AI Research

lesswrong.com·

MIRAGE: A Polarity-Flipping Encoding Subspace in LLM Agents

🤖AI Agents Academic

A Geometric View for Understanding Concept Learning and Neuron Interpretation in Sparse Autoencoders

📉Deep Learning Academic

DiffoR: A Unified Continuous Generative Framework for Universal Ordinal Regression

📐Scaling Laws Academic

Why Limit the Residual Stream to Layers and Not Tokens? Persistent Memory for Continuous Latent Reasoning

💬LLMs Academic

LLM Self-Recognition: Steering and Retrieving Activation Signatures

🧠AI Research Academic

Temporal Preference Concepts and their Functions in a Large Language Model

💬LLMs Academic

Sign up or log in to see more results

Log in to enable infinite scrolling