🎯 Alignment Research - hop1.ng.1357 · Scour

The Alignment Target Problem: Divergent Moral Judgments of Humans, AI Systems, and Their Designers 🛡️AI Safety

An Alignment Journal: Adaptation to AI 🛡️AI Safety

lesswrong.com·2d

RLHF Flow-GRPO implementation POC by ifilipis · Pull Request #808 🪝eBPF

github.com·2d·r/StableDiffusion

Deep Learning Weekly: Issue 453 ⚡Edge AI

deeplearningweekly.com·14h

Reinforcement fine-tuning with LLM-as-a-judge 🪄Prompt Engineering

aws.amazon.com·9h

Rewarding the Scientific Process: Process-Level Reward Modeling for Agentic Data Analysis 🔍AI Interpretability

AI Value Capture 🇨🇳Chinese AI

newsletter.semianalysis.com

·2h·Hacker News

Three Models of RLHF Annotation: Extension, Evidence, and Authority ⚙️MLOps

Collective intelligence framework shows how human-AI teams may make better decisions 🤝Human-AI Collaboration

techxplore.com·9h

AI Infrastructure Architect · Builder · Author 🇨🇳Chinese AI

markferraz.com·10h·Hacker News

reward-lens: A Mechanistic Interpretability Library for Reward Models 🔍AI Interpretability

AI & Alignment 🛡️AI Safety

chriscoyier.net·5d·Hacker News

The Inference Economy: Token Use 💭Reasoning Models

frontierai.substack.com·11h·Substack

Reward Models Are Secretly Value Functions: Temporally Coherent Reward Modeling 🧠Agent Memory

Algorithm that gets ‘under the hood’ of AI models could effectively steer their responses 🔍AI Interpretability

·1d

The Human Creativity Benchmark – Evaluating Generative AI in Creative Work 🎭Claude

contralabs.com·10h·Hacker News

The AI Flippening Is Here 🔎AI Auditing

maximepeabody.substack.com·2d·Substack

What Sentences Cause Alignment Faking? 🛡️AI Safety

lesswrong.com·3d

New Content From <i>Current Directions in Psychological Science</i> 🌋Existential Risk Research

psychologicalscience.org·14h

Hidden States Know Where Reasoning Diverges: Credit Assignment via Span-Level Wasserstein Distance 🔢BitNet

Log in to enable infinite scrolling