🛡️ AI Safety - ibrahimsharaf · Scour

Sponsored: Building bankable, resilient data centers: From site to operation 🚀LLM Deployment

datacenterdynamics.com·6d

Preference Instability in Reward Models: Detection and Mitigation via Sparse Autoencoders 🎯RLHF

On Humanity & Human Beings 🔓Open Source AI

mhdempsey.substack.com·5d·Substack

Document-tuning instills durable animal compassion in LLMs (and generalizes to humans) 🏢LLM Adoption

lesswrong.com·1h

Disney's near-perfect 10-part crime thriller saga is so good, you'll finish it in one weekend 📊LLM Evaluation

polygon.com·3d

Less-relevant results

AI essays 🤖AI Agents

rhollick.wordpress.com·13h

GRID: Graph Representation of Intelligence Data for Security Text Knowledge Graph Construction 🕸️Knowledge Graphs

Compositional Adversarial Training for Robust Visual Watermarking 🧠LLMs

AI emotions and aligned behavior 🎯RLHF

lesswrong.com·2d

By 20 to 1, Americans Want the White House to Safety Test AI 🔓Open Source AI

ifstudies.org·2d·r/OpenAI

Four AI supply-chain attacks in 50 days exposed the release pipeline red teams aren't covering 🤖AI Agents

venturebeat.com·2d

rl for red teaming: training models to attack and defend themselves 🎯RLHF

castform.com·6d·Hacker News

Stitched Value Model for Diffusion Alignment ⚡Quantization

Automated Alignment is Harder Than You Think ⚙️Transformers

lesswrong.com·6d

Confidentiality is not security: Why the real AI runtime crisis Is the Authorization Gap 🤖AI Agents

·6d

PAIR: Prefix-Aware Internal Reward Model for Multi-Turn Agent Optimization 🎯RLHF

The 'Mythos Moment' 🤖AI Agents

profserious.substack.com·3d·Substack

Let's have more partial insiders. 🏢LLM Adoption

lesswrong.com·1d

The Anthropic Case: Do We Need an Ethical Framework for Interacting with AI? 🔓Open Source AI

Pairwise Preference Reward and Group-Based Diversity Enhancement for Superior Open-Ended Generation 🎯RLHF

Log in to enable infinite scrolling