Reward Function Design: a starter pack
lesswrong.com¡1h
🪄Prompt Engineering
Preview
Report Post

Published on December 8, 2025 7:15 PM GMT

In the companion post We need a field of Reward Function Design, I implore researchers to think about what RL reward functions (if any) will lead to RL agents that are not ruthless power-seeking consequentialists. And I further suggested that human social instincts constitutes an intriguing example we should study, since they seem to be an existence proof that such reward functions exist. So what is the general principle of Reward Function Design that underlies the non-ruthless (“ruthful”??) properties of human social instincts? And whatever that general principle is, can we apply it to future RL agent AGIs?

I don’t hav…

Similar Posts

Loading similar posts...