Skip to main content
Scour
Browse
Getting Started
Login
Sign Up
You are offline. Trying to reconnect...
Copied to clipboard
Unable to share or copy to clipboard
AI Safety
🛡️ AI Safety
AI alignment, AI safety, interpretability, AGI risk, RLHF
Filter Results
Timeframe
Fresh
Past Hour
Today
This Week
This Month
Feeds to Scour
Subscribed
All
Scoured
297
posts in
6.8
ms
The Three Filters: Why Almost Every Plan to Survive ASI Fails Miserably
🌐
Distributed Systems
lesswrong.com
·
10h
10 hours ago
Actions for The Three Filters: Why Almost Every Plan to Survive ASI Fails Miserably
teia-igo-vs-claude-opus-4.8/README.en.md at main · joseteiadirector/teia-igo-vs-claude-opus-4.8
✅
Formal Verification
Content type:
Code
github.com
·
3d
3 days ago
·
Hacker News
Actions for teia-igo-vs-claude-opus-4.8/README.en.md at main · joseteiadirector/teia-igo-vs-claude-opus-4.8
Learning to Attack and Defend: Adaptive
Red
Teaming
of Language
Models
via GRPO
🎲
Procedural Generation
Content type:
Academic
arxiv.org
·
1d
1 day ago
Actions for Learning to Attack and Defend: Adaptive Red Teaming of Language Models via GRPO
VFUSE: Virulent Feature Understanding with Sparse autoEncoders
🔬
Mech Interp
Content type:
Academic
arxiv.org
·
15h
15 hours ago
Actions for VFUSE: Virulent Feature Understanding with Sparse autoEncoders
A Regret Minimization Framework on Preference Learning in Large Language
Models
🔍
Information Retrieval
Content type:
Academic
arxiv.org
·
1d
1 day ago
Actions for A Regret Minimization Framework on Preference Learning in Large Language Models
Adversarial
Robustness
of Activation Steering in Large Language
Models
🔍
Interpretability
Content type:
Academic
arxiv.org
·
1d
1 day ago
Actions for Adversarial Robustness of Activation Steering in Large Language Models
Trajectory Geometry of Transformer Representations Across Layers
🔍
Interpretability
Content type:
Academic
arxiv.org
·
1d
1 day ago
Actions for Trajectory Geometry of Transformer Representations Across Layers
Learnings from starting an
AI
safety
research
team
✅
Formal Verification
lesswrong.com
·
5d
5 days ago
Actions for Learnings from starting an AI safety research team
Some economics of artificial superintelligence
🔬
Mech Interp
Content type:
Academic
arxiv.org
·
1d
1 day ago
Actions for Some economics of artificial superintelligence
Preparing for Warning Shots to Catalyze International Cooperation on
AGI
Risks
🎲
Procedural Generation
lesswrong.com
·
5d
5 days ago
Actions for Preparing for Warning Shots to Catalyze International Cooperation on AGI Risks
Shared
Semantics, Divergent
Mechanisms
: Unsupervised Feature Discovery by Aligning Semantics and
Mechanisms
🔍
Interpretability
Content type:
Academic
arxiv.org
·
1d
1 day ago
Actions for Shared Semantics, Divergent Mechanisms: Unsupervised Feature Discovery by Aligning Semantics and Mechanisms
FoldSAE: Learning to Steer Protein Folding Through Sparse Representations
🔬
Mech Interp
Content type:
Academic
arxiv.org
·
1d
1 day ago
Actions for FoldSAE: Learning to Steer Protein Folding Through Sparse Representations
Neglected Basics of
AI
Alignment
🌐
Distributed Systems
lesswrong.com
·
3d
3 days ago
Actions for Neglected Basics of AI Alignment
EvalStop: Using World Feedback to Detect and Correct Reward Overoptimization in Multi-Tenant
RLHF
Platforms
🌐
Distributed Systems
Content type:
Academic
arxiv.org
·
6d
6 days ago
Actions for EvalStop: Using World Feedback to Detect and Correct Reward Overoptimization in Multi-Tenant RLHF Platforms
Ablation-Reversible Heads Don't Transfer: A Stress Test for
Mechanistic
Role Claims in Transformers
🔍
Interpretability
Content type:
Academic
arxiv.org
·
1d
1 day ago
Actions for Ablation-Reversible Heads Don't Transfer: A Stress Test for Mechanistic Role Claims in Transformers
Iliad is Hiring
✅
Formal Verification
lesswrong.com
·
3d
3 days ago
Actions for Iliad is Hiring
Adversarial
Attack and Disturbance Detection by Hadamard-Coded Output Representations for Object Detection and Semantic Segmentation
🔍
Interpretability
Content type:
Academic
arxiv.org
·
1d
1 day ago
Actions for Adversarial Attack and Disturbance Detection by Hadamard-Coded Output Representations for Object Detection and Semantic Segmentation
Principled Agent Debate:
Adversarial
Arbitration for Sycophancy Reduction in Large Language
Models
🌐
Distributed Systems
Content type:
Academic
arxiv.org
·
1d
1 day ago
Actions for Principled Agent Debate: Adversarial Arbitration for Sycophancy Reduction in Large Language Models
One Year of PauseAI UK
🌐
Distributed Systems
lesswrong.com
·
5d
5 days ago
Actions for One Year of PauseAI UK
FoeGlass: Simple In-Context Learning Is Enough for
Red
Teaming
Audio
Deepfake Detectors
🔍
Interpretability
Content type:
Academic
arxiv.org
·
6d
6 days ago
Actions for FoeGlass: Simple In-Context Learning Is Enough for Red Teaming Audio Deepfake Detectors
Sign up or log in to see more results
Sign Up
Login
« Page 2
Log in to enable infinite scrolling
Keyboard Shortcuts
Navigation
Next / previous item
j
/
k
Open post
o
or
Enter
Preview post
v
Post Actions
Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s
Recommendations
Add interest / feed
Enter
Not interested
x
Go to
Home
g
h
Interests
g
i
Feeds
g
f
Likes
g
l
History
g
y
Changelog
g
c
Settings
g
s
Browse
g
b
Search
/
Pagination
Next page
n
Previous page
p
General
Show this help
?
Submit feedback
!
Close modal / unfocus
Esc
Press
?
anytime to show this help