How’s it going? Reinforcement learning in language models recruits a functional welfare axis (opens in new tab)

Covers 2 stories including Qwen3 Technical ReportCovered by lesswrong.comDiscussed on Hacker News

How does reinforcement learning shape a language model’s internal representations? We present evidence that RL recruits a pre-existing representation of functional welfare: an estimate of how well or badly the system is doing, relative to its goals. We train several language models in a novel, semantically neutral maze environment. We then extract concept vectors for rewarded and punished trajectories, and evaluate those vectors in settings unrelated to the maze environment. The punishment ve...

Read the original article

Sign in to keep reading the full article.

Sign Up Log In

Covered in 2 articles

lesswrong.com·

Your Model Organisms Might Be Fried

lesswrong.com·

Covered in 2 articles

Your Model Organisms Might Be Fried

How's it going? Reinforcement learning in language models recruits a functional welfare axis