🎯 AI Alignment - buckman · Scour

AI & Alignment 👨‍💻AI Coding

chriscoyier.net·1d·Hacker News

Takes on Automating Alignment 👨‍💻AI Coding

lesswrong.com·6d

Hot Research Topics in AI and ML in 2026 and Their Philosophical Connections 🤨AI Criticism

omseeth.github.io·1d·Hacker News

Smarter Doesn’t Mean Safer: A Real RLHF Experiment on LLM Behavior 🎯RLHF

Could AI language models be used to help align themselves? 🤖GenAI

3quarksdaily.com·4d

Diary of a "Doomer": 12+ years arguing about AI risk (part 3: the LLM era) 🤨AI Criticism

lesswrong.com·21h

Import AI 454: Automating alignment research; safety study of a Chinese model; HiFloat4 🛡️AI Safety Evals

jack-clark.net·6d

Alignment by Default? 🛡️AI Safety

blog.cosmos-institute.org·6d·Hacker News

Mechanistic interpretability and lean 🎯AI Reliability

alok.github.io·5d

Third Symposium on AIT & ML: AI Safety Applications 🛡️AI Safety Evals

lesswrong.com·1d

Machine learning-driven alignment architecture of heterogeneous data with transient varying semantics ⚙️ML Infrastructure

AI models can learn harmful traits that evade safety filters 🛡️AI Safety

Alignment Faking Replication and Chain-of-Thought Monitoring Extensions 🐛Fuzzing

lesswrong.com·3h

GCs Are Way Beyond ‘Strategic.’ In AI Era, They Build Alignment ⚖️AI Governance

news.bloomberglaw.com·6d

Less human AI agents, please 🎼Agent Orchestration

gridthegrey.com·5d·DEV

How could I best use this opportunity? (AI Safety) 🛡️AI Safety Evals

lesswrong.com·8h

The Changing North Star of AI Control 🏛Sovereign AI Infrastructure

lesswrong.com·4d

Monday AI Radar #22 ✍️Prompt Engineering

lesswrong.com·5d

Pando: A Controlled Benchmark for Interpretability Methods 🎯AI Reliability

lesswrong.com·5d

Import AI 454: Automating alignment research; safety study of a Chinese model; HiFloat4 🛡️AI Safety Evals

importai.substack.com·6d·Substack

Log in to enable infinite scrolling