🎯 AI Alignment - faruk · Scour

Contra Dance at LessOnline

⚙️AI Infrastructure

Trajectory Geometry of Transformer Representations Across Layers

🧠LLMs Academic

Sparse probes and murky physics: a case study of interpretability challenges in a foundation model for continuum dynamics

🔍GEO Academic

One Year of PauseAI UK

📊AI Monitoring

lesswrong.com·

Shared Semantics, Divergent Mechanisms: Unsupervised Feature Discovery by Aligning Semantics and Mechanisms

🧠LLMs Academic

Less-relevant results

Coming Around To Political Donations

🧑‍💻Indie Hackers

Principled Agent Debate: Adversarial Arbitration for Sycophancy Reduction in Large Language Models

🧠LLMs Academic

Substrate Asymmetry in User-Side Memory: A Diagnostic Framework

🧠LLMs Academic

Book of Cron Job

🧑‍💻Indie Hackers

lesswrong.com·

Ablation-Reversible Heads Don't Transfer: A Stress Test for Mechanistic Role Claims in Transformers

🧠LLMs Academic

FoldSAE: Learning to Steer Protein Folding Through Sparse Representations

🔍GEO Academic

VFUSE: Virulent Feature Understanding with Sparse autoEncoders

🧠LLMs Academic

When Attribution Patching Lies: Diagnosis and a Second-Order Correction

🧠LLMs Academic

Towards a Formal Scientific Epistemology

lesswrong.com·

Subspace-Aware Sparse Autoencoders for Effective Mechanistic Interpretability

🔍GEO Academic

Accounting for Context: Shaping Moral Credences for Value Alignment

🧩Epistemics Academic

Alignment Defends LLMs from Property Inference Attacks

🧠LLMs Academic

[Paper] Dictionary Learning Identifiability for Understanding SAEs

lesswrong.com·

Interactions Between Crosscoder Features: A Compact Proofs Perspective

🧠LLMs Academic

My research agenda and work

lesswrong.com·

Sign up or log in to see more results

Log in to enable infinite scrolling