AI Safety

Feeds to Scour
SubscribedAll
Scoured 297 posts in 6.8 ms

The Three Filters: Why Almost Every Plan to Survive ASI Fails Miserably

 🌐Distributed Systems
lesswrong.com·

teia-igo-vs-claude-opus-4.8/README.en.md at main · joseteiadirector/teia-igo-vs-claude-opus-4.8

 Formal Verification  Content type: Code
github.com··Hacker News

Learning to Attack and Defend: Adaptive Red Teaming of Language Models via GRPO

 🎲Procedural Generation  Content type: Academic
arxiv.org·

VFUSE: Virulent Feature Understanding with Sparse autoEncoders

 🔬Mech Interp  Content type: Academic
arxiv.org·

A Regret Minimization Framework on Preference Learning in Large Language Models

 🔍Information Retrieval  Content type: Academic
arxiv.org·

Adversarial Robustness of Activation Steering in Large Language Models

 🔍Interpretability  Content type: Academic
arxiv.org·

Trajectory Geometry of Transformer Representations Across Layers

 🔍Interpretability  Content type: Academic
arxiv.org·

Learnings from starting an AI safety research team

 Formal Verification
lesswrong.com·

Some economics of artificial superintelligence

 🔬Mech Interp  Content type: Academic
arxiv.org·

Preparing for Warning Shots to Catalyze International Cooperation on AGI Risks

 🎲Procedural Generation
lesswrong.com·

Shared Semantics, Divergent Mechanisms: Unsupervised Feature Discovery by Aligning Semantics and Mechanisms

 🔍Interpretability  Content type: Academic
arxiv.org·

FoldSAE: Learning to Steer Protein Folding Through Sparse Representations

 🔬Mech Interp  Content type: Academic
arxiv.org·

Neglected Basics of AI Alignment

 🌐Distributed Systems
lesswrong.com·

EvalStop: Using World Feedback to Detect and Correct Reward Overoptimization in Multi-Tenant RLHF Platforms

 🌐Distributed Systems  Content type: Academic
arxiv.org·

Ablation-Reversible Heads Don't Transfer: A Stress Test for Mechanistic Role Claims in Transformers

 🔍Interpretability  Content type: Academic
arxiv.org·

Iliad is Hiring

 Formal Verification
lesswrong.com·

Adversarial Attack and Disturbance Detection by Hadamard-Coded Output Representations for Object Detection and Semantic Segmentation

 🔍Interpretability  Content type: Academic
arxiv.org·

Principled Agent Debate: Adversarial Arbitration for Sycophancy Reduction in Large Language Models

 🌐Distributed Systems  Content type: Academic
arxiv.org·

One Year of PauseAI UK

 🌐Distributed Systems
lesswrong.com·

FoeGlass: Simple In-Context Learning Is Enough for Red Teaming Audio Deepfake Detectors

 🔍Interpretability  Content type: Academic
arxiv.org·
Sign up or log in to see more results

Keyboard Shortcuts

Navigation

Next / previous item
j/k
Open post
oorEnter
Preview post
v

Post Actions

Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s

Recommendations

Add interest / feed
Enter
Not interested
x

Go to

Home
gh
Interests
gi
Feeds
gf
Likes
gl
History
gy
Changelog
gc
Settings
gs
Browse
gb
Search
/

General

Show this help
?
Submit feedback
!
Close modal / unfocus
Esc

Press ? anytime to show this help