🤖 AI - deenybird · Scour

Neglected Basics of AI Alignment

⚖️AI Ethics

lesswrong.com·

Principled Agent Debate: Adversarial Arbitration for Sycophancy Reduction in Large Language Models

⚖️AI Ethics Academic

EvalStop: Using World Feedback to Detect and Correct Reward Overoptimization in Multi-Tenant RLHF Platforms

⚖️AI Ethics Academic

Sequential Data Poisoning in LLM Post-Training

⚖️AI Ethics Academic

Sparse Mixture-of-Experts Reward Models Learn Interpretable and Specialized Experts for Personalized Preference Modeling

⚖️AI Ethics Academic

BiasGRPO: Stabilizing Bias Mitigation in High-Variance Reward Landscapes via Group-Relative Policy Optimization

⚖️AI Ethics Academic

Do We Want a Superintelligent People-Pleaser?

⚖️AI Ethics

lesswrong.com·

Log in to enable infinite scrolling