Skip to main content
Scour
Browse
Getting Started
Login
Sign Up
You are offline. Trying to reconnect...
Copied to clipboard
Unable to share or copy to clipboard
AI Alignment
🎯 AI Alignment
alignment research, AI safety, RLHF, value alignment
Filter Results
Timeframe
Fresh
Past Hour
Today
This Week
This Month
Feeds to Scour
Subscribed
All
Scoured
72
posts in
5.6
ms
Contra Dance at LessOnline
⚙️
AI Infrastructure
jefftk.com
·
4d
4 days ago
Actions for Contra Dance at LessOnline
Trajectory Geometry of Transformer Representations Across Layers
🧠
LLMs
Content type:
Academic
arxiv.org
·
2d
2 days ago
Actions for Trajectory Geometry of Transformer Representations Across Layers
Sparse probes and murky physics: a case study of
interpretability
challenges in a foundation
model
for continuum dynamics
🔍
GEO
Content type:
Academic
arxiv.org
·
10h
10 hours ago
Actions for Sparse probes and murky physics: a case study of interpretability challenges in a foundation model for continuum dynamics
One Year of PauseAI UK
📊
AI Monitoring
lesswrong.com
·
5d
5 days ago
Actions for One Year of PauseAI UK
Shared Semantics, Divergent
Mechanisms
: Unsupervised Feature Discovery by Aligning Semantics and
Mechanisms
🧠
LLMs
Content type:
Academic
arxiv.org
·
2d
2 days ago
Actions for Shared Semantics, Divergent Mechanisms: Unsupervised Feature Discovery by Aligning Semantics and Mechanisms
Less-relevant results
Coming Around To Political Donations
🧑💻
Indie Hackers
jefftk.com
·
5d
5 days ago
Actions for Coming Around To Political Donations
Principled Agent Debate: Adversarial Arbitration for Sycophancy Reduction in Large Language
Models
🧠
LLMs
Content type:
Academic
arxiv.org
·
2d
2 days ago
Actions for Principled Agent Debate: Adversarial Arbitration for Sycophancy Reduction in Large Language Models
Substrate Asymmetry in User-Side Memory: A Diagnostic Framework
🧠
LLMs
Content type:
Academic
arxiv.org
·
10h
10 hours ago
Actions for Substrate Asymmetry in User-Side Memory: A Diagnostic Framework
Book of Cron Job
🧑💻
Indie Hackers
lesswrong.com
·
6d
6 days ago
Actions for Book of Cron Job
Ablation-Reversible Heads Don't Transfer: A Stress Test for
Mechanistic
Role Claims in Transformers
🧠
LLMs
Content type:
Academic
arxiv.org
·
2d
2 days ago
Actions for Ablation-Reversible Heads Don't Transfer: A Stress Test for Mechanistic Role Claims in Transformers
FoldSAE: Learning to Steer Protein Folding Through Sparse Representations
🔍
GEO
Content type:
Academic
arxiv.org
·
2d
2 days ago
Actions for FoldSAE: Learning to Steer Protein Folding Through Sparse Representations
VFUSE: Virulent Feature Understanding with Sparse autoEncoders
🧠
LLMs
Content type:
Academic
arxiv.org
·
1d
1 day ago
Actions for VFUSE: Virulent Feature Understanding with Sparse autoEncoders
When Attribution Patching Lies: Diagnosis and a Second-Order Correction
🧠
LLMs
Content type:
Academic
arxiv.org
·
1d
1 day ago
Actions for When Attribution Patching Lies: Diagnosis and a Second-Order Correction
Towards a Formal Scientific Epistemology
🧩
Epistemics
lesswrong.com
·
1d
1 day ago
Actions for Towards a Formal Scientific Epistemology
Subspace-Aware Sparse Autoencoders for Effective
Mechanistic
Interpretability
🔍
GEO
Content type:
Academic
arxiv.org
·
6d
6 days ago
Actions for Subspace-Aware Sparse Autoencoders for Effective Mechanistic Interpretability
Accounting for Context: Shaping Moral Credences for
Value
Alignment
🧩
Epistemics
Content type:
Academic
arxiv.org
·
3d
3 days ago
Actions for Accounting for Context: Shaping Moral Credences for Value Alignment
Alignment
Defends LLMs from Property Inference Attacks
🧠
LLMs
Content type:
Academic
arxiv.org
·
1d
1 day ago
Actions for Alignment Defends LLMs from Property Inference Attacks
[Paper] Dictionary Learning Identifiability for Understanding SAEs
🧠
LLMs
lesswrong.com
·
6d
6 days ago
Actions for [Paper] Dictionary Learning Identifiability for Understanding SAEs
Interactions Between Crosscoder Features: A Compact Proofs Perspective
🧠
LLMs
Content type:
Academic
arxiv.org
·
1d
1 day ago
Actions for Interactions Between Crosscoder Features: A Compact Proofs Perspective
My
research
agenda and work
🧠
LLMs
lesswrong.com
·
6d
6 days ago
Actions for My research agenda and work
Sign up or log in to see more results
Sign Up
Login
« Page 2
Log in to enable infinite scrolling
Keyboard Shortcuts
Navigation
Next / previous item
j
/
k
Open post
o
or
Enter
Preview post
v
Post Actions
Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s
Recommendations
Add interest / feed
Enter
Not interested
x
Go to
Home
g
h
Interests
g
i
Feeds
g
f
Likes
g
l
History
g
y
Changelog
g
c
Settings
g
s
Browse
g
b
Search
/
Pagination
Next page
n
Previous page
p
General
Show this help
?
Submit feedback
!
Close modal / unfocus
Esc
Press
?
anytime to show this help