AI Alignment

Feeds to Scour
SubscribedAll
Scoured 72 posts in 5.6 ms

Contra Dance at LessOnline

 ⚙️AI Infrastructure
jefftk.com·

Trajectory Geometry of Transformer Representations Across Layers

 🧠LLMs  Content type: Academic
arxiv.org·

Sparse probes and murky physics: a case study of interpretability challenges in a foundation model for continuum dynamics

 🔍GEO  Content type: Academic
arxiv.org·

One Year of PauseAI UK

 📊AI Monitoring
lesswrong.com·

Shared Semantics, Divergent Mechanisms: Unsupervised Feature Discovery by Aligning Semantics and Mechanisms

 🧠LLMs  Content type: Academic
arxiv.org·
Less-relevant results

Coming Around To Political Donations

 🧑‍💻Indie Hackers
jefftk.com·

Principled Agent Debate: Adversarial Arbitration for Sycophancy Reduction in Large Language Models

 🧠LLMs  Content type: Academic
arxiv.org·

Substrate Asymmetry in User-Side Memory: A Diagnostic Framework

 🧠LLMs  Content type: Academic
arxiv.org·

Book of Cron Job

 🧑‍💻Indie Hackers
lesswrong.com·

Ablation-Reversible Heads Don't Transfer: A Stress Test for Mechanistic Role Claims in Transformers

 🧠LLMs  Content type: Academic
arxiv.org·

FoldSAE: Learning to Steer Protein Folding Through Sparse Representations

 🔍GEO  Content type: Academic
arxiv.org·

VFUSE: Virulent Feature Understanding with Sparse autoEncoders

 🧠LLMs  Content type: Academic
arxiv.org·

When Attribution Patching Lies: Diagnosis and a Second-Order Correction

 🧠LLMs  Content type: Academic
arxiv.org·

Towards a Formal Scientific Epistemology

 🧩Epistemics
lesswrong.com·

Subspace-Aware Sparse Autoencoders for Effective Mechanistic Interpretability

 🔍GEO  Content type: Academic
arxiv.org·

Accounting for Context: Shaping Moral Credences for Value Alignment

 🧩Epistemics  Content type: Academic
arxiv.org·

Alignment Defends LLMs from Property Inference Attacks

 🧠LLMs  Content type: Academic
arxiv.org·

[Paper] Dictionary Learning Identifiability for Understanding SAEs

 🧠LLMs
lesswrong.com·

Interactions Between Crosscoder Features: A Compact Proofs Perspective

 🧠LLMs  Content type: Academic
arxiv.org·

My research agenda and work

 🧠LLMs
lesswrong.com·
Sign up or log in to see more results

Keyboard Shortcuts

Navigation

Next / previous item
j/k
Open post
oorEnter
Preview post
v

Post Actions

Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s

Recommendations

Add interest / feed
Enter
Not interested
x

Go to

Home
gh
Interests
gi
Feeds
gf
Likes
gl
History
gy
Changelog
gc
Settings
gs
Browse
gb
Search
/

General

Show this help
?
Submit feedback
!
Close modal / unfocus
Esc

Press ? anytime to show this help