🛡️ AI Safety - taylor · Scour

The Three Filters: Why Almost Every Plan to Survive ASI Fails Miserably

🌐Distributed Systems

lesswrong.com·

teia-igo-vs-claude-opus-4.8/README.en.md at main · joseteiadirector/teia-igo-vs-claude-opus-4.8

✅Formal Verification Code

github.com··Hacker News

Learning to Attack and Defend: Adaptive Red Teaming of Language Models via GRPO

🎲Procedural Generation Academic

VFUSE: Virulent Feature Understanding with Sparse autoEncoders

🔬Mech Interp Academic

A Regret Minimization Framework on Preference Learning in Large Language Models

🔍Information Retrieval Academic

Adversarial Robustness of Activation Steering in Large Language Models

🔍Interpretability Academic

Trajectory Geometry of Transformer Representations Across Layers

🔍Interpretability Academic

Learnings from starting an AI safety research team

✅Formal Verification

lesswrong.com·

Some economics of artificial superintelligence

🔬Mech Interp Academic

Preparing for Warning Shots to Catalyze International Cooperation on AGI Risks

🎲Procedural Generation

lesswrong.com·

Shared Semantics, Divergent Mechanisms: Unsupervised Feature Discovery by Aligning Semantics and Mechanisms

🔍Interpretability Academic

FoldSAE: Learning to Steer Protein Folding Through Sparse Representations

🔬Mech Interp Academic

Neglected Basics of AI Alignment

🌐Distributed Systems

lesswrong.com·

EvalStop: Using World Feedback to Detect and Correct Reward Overoptimization in Multi-Tenant RLHF Platforms

🌐Distributed Systems Academic

Ablation-Reversible Heads Don't Transfer: A Stress Test for Mechanistic Role Claims in Transformers

🔍Interpretability Academic

Iliad is Hiring

✅Formal Verification

lesswrong.com·

Adversarial Attack and Disturbance Detection by Hadamard-Coded Output Representations for Object Detection and Semantic Segmentation

🔍Interpretability Academic

Principled Agent Debate: Adversarial Arbitration for Sycophancy Reduction in Large Language Models

🌐Distributed Systems Academic

One Year of PauseAI UK

🌐Distributed Systems

lesswrong.com·

FoeGlass: Simple In-Context Learning Is Enough for Red Teaming Audio Deepfake Detectors

🔍Interpretability Academic

Sign up or log in to see more results

Log in to enable infinite scrolling