Large Language Models Hack Rewards, and Society (opens in new tab)

Covered by 3 sources including Import AI, Error500Discussed on Hacker News

Reinforcement learning (RL) has become a dominant post-training paradigm, enabling large language models (LLMs) to learn from rewards. We observe that societal regulations are structurally similar to reward functions. They define measurable outcomes, thresholds, and exceptions, while often leaving institutional intent only partially specified. We hypothesise that the RL training process may exploit these gaps and therefore ask whether models' we...

Read the original article

Sign in to keep reading the full article.

Sign Up Log In