🎯 Alignment Research - hop1.ng.1357 · Scour

The Alignment Target Problem: Divergent Moral Judgments of Humans, AI Systems, and Their Designers 🛡️AI Safety

AI & Alignment 🛡️AI Safety

chriscoyier.net·5d·Hacker News

An Alignment Journal: Adaptation to AI 🛡️AI Safety

lesswrong.com·2d

RLHF Flow-GRPO implementation POC by ifilipis · Pull Request #808 🪝eBPF

github.com·2d·r/StableDiffusion

reward-lens: A Mechanistic Interpretability Library for Reward Models 🔍AI Interpretability

Deep Learning Weekly: Issue 453 ⚡Edge AI

deeplearningweekly.com·10h

Reinforcement fine-tuning with LLM-as-a-judge 🪄Prompt Engineering

aws.amazon.com·5h

🥇Top AI Papers of the Week 🇨🇳Chinese AI

nlp.elvissaravia.com·4d

Rewarding the Scientific Process: Process-Level Reward Modeling for Agentic Data Analysis 🔍AI Interpretability

Collective intelligence framework shows how human-AI teams may make better decisions 🤝Human-AI Collaboration

techxplore.com·5h

AI Infrastructure Architect · Builder · Author 🇨🇳Chinese AI

markferraz.com·6h·Hacker News

Reward Models Are Secretly Value Functions: Temporally Coherent Reward Modeling 🧠Agent Memory

The Inference Economy: Token Use 💭Reasoning Models

frontierai.substack.com·7h·Substack

Alignment Makes Models More Decisive Without Making Them More Truthful 🛡️AI Safety

zenodo.org·3d·r/singularity

The Human Creativity Benchmark – Evaluating Generative AI in Creative Work 🎭Claude

contralabs.com·6h·Hacker News

Three Models of RLHF Annotation: Extension, Evidence, and Authority ⚙️MLOps

Alibaba's Metis agent cuts redundant AI tool calls from 98% to 2% — and gets more accurate doing it 🕵️AI Agents

venturebeat.com·4h

New Content From <i>Current Directions in Psychological Science</i> 🌋Existential Risk Research

psychologicalscience.org·10h

What Sentences Cause Alignment Faking? 🛡️AI Safety

lesswrong.com·2d

Hidden States Know Where Reasoning Diverges: Credit Assignment via Span-Level Wasserstein Distance 🔢BitNet

Log in to enable infinite scrolling