Interpretability

Feeds to Scour
SubscribedAll
Scoured 40 posts in 9.2 ms

Dual-Stance Evaluation of Sycophancy: The Structure of Agreement and the Limits of Intervention

馃挰LLMsContent type: Academic
arxiv.org

Mechanistic Insights into Functional Sparsity in Multimodal LLMs via CoRe Heads

馃挰LLMsContent type: Academic
arxiv.org

TEVI: Text-Conditioned Editing of Visual Representations via Sparse Autoencoders for Improved Vision-Language Alignment

馃挰LLMsContent type: Academic
arxiv.org

MechLens: Late Crystallization of Factual Knowledge Explains Intervention Effectiveness in Language Models

馃挰LLMsContent type: Academic
arxiv.org

A Unifying Framework for Concept-Based Representational Similarity

馃攧TransformersContent type: Academic
arxiv.org

Position: Don't Just "Fix it in Post": A Science of AI Must Study Training Dynamics

馃AI ResearchContent type: Academic
arxiv.org

Set-Based Transformer for Atmospheric Compensation in Standoff LWIR Hyperspectral Imaging

馃搻Scaling LawsContent type: Academic
arxiv.org

Whisper Hallucination Detection and Mitigation via Hidden Representation Steering and Sparse AutoEncoders

鈿欙笍Model TrainingContent type: Academic
arxiv.org

Vision-Language Asymmetry in Bistable Image Captioning

馃AI ResearchContent type: Academic
arxiv.org

The Tell-Tale Norm: $\ell_2$ Magnitude as a Signal for Reasoning Dynamics in Large Language Models

馃AI ResearchContent type: Academic
arxiv.org

The Amplifying Mirror: Locating and Steering the Partisan Direction inside a Large Language Model

馃挰LLMsContent type: Academic
arxiv.org

Unsupervised Pattern Analysis in Japanese Veterinary Toxicology: A Regulatory-Compliant Framework for Cross-Species Risk Assessment

馃搻Scaling LawsContent type: Academic
arxiv.org

The Neutral Mask: How RLHF Provides Shallow Alignment while Leaving Partisan Structure Intact in a Large Language Model

馃挰LLMsContent type: Academic
arxiv.org

WAV: Multi-Resolution Block Residual Routing for Deep Decoder-Only Transformers

馃攧TransformersContent type: Academic
arxiv.org

TALAN: Task-Aligned Latent Adaptation Networks for Targeted Post-Training of Large Language Models

鈿欙笍Model TrainingContent type: Academic
arxiv.org

Consistency Training Along the Transformer Stack

馃挰LLMsContent type: Academic
arxiv.org

Dominant-Layer ZO: A Single Layer Dominates Zeroth-Order Fine-Tuning of LLMs

鈿欙笍Model TrainingContent type: Academic
arxiv.org

Localizing Prompt Ambiguity in Large Language Models with Probe-Targeted Attribution

馃挰LLMsContent type: Academic
arxiv.org

Interpreting Brain Responses to Language with Sparse Features from Language Models

馃挰LLMsContent type: Academic
arxiv.org

Dead Directions: Geometric Singular Learning

馃幃Reinforcement LearningContent type: Academic
arxiv.org

No more posts from Bingran's subscribed feeds.

Keyboard Shortcuts

Navigation

Next / previous item
j/k
Open post
oorEnter
Preview post
v

Post Actions

Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s

Recommendations

Add interest / feed
Enter
Not interested
x

Go to

Home
gh
Interests
gi
Feeds
gf
Likes
gl
History
gy
Changelog
gc
Settings
gs
Browse
gb
Search
/

General

Show this help
?
Submit feedback
!
Close modal / unfocus
Esc

Press ? anytime to show this help