Published on December 8, 2025 7:15 PM GMT
In the companion post We need a field of Reward Function Design, I implore researchers to think about what RL reward functions (if any) will lead to RL agents that are not ruthless power-seeking consequentialists. And I further suggested that human social instincts constitutes an intriguing example we should study, since they seem to be an existence proof that such reward functions exist. So what is the general principle of Reward Function Design that underlies the non-ruthless (âruthfulâ??) properties of human social instincts? And whatever that general principle is, can we apply it to future RL agent AGIs?
I donât havâŚ
Published on December 8, 2025 7:15 PM GMT
In the companion post We need a field of Reward Function Design, I implore researchers to think about what RL reward functions (if any) will lead to RL agents that are not ruthless power-seeking consequentialists. And I further suggested that human social instincts constitutes an intriguing example we should study, since they seem to be an existence proof that such reward functions exist. So what is the general principle of Reward Function Design that underlies the non-ruthless (âruthfulâ??) properties of human social instincts? And whatever that general principle is, can we apply it to future RL agent AGIs?
I donât have all the answers, but I think Iâve made some progress, and the goal of this post is to make it easier for others to get up to speed with my current thinking.
What I do have, thanks mostly to work from the past 12 months, is five frames / terms / mental images for thinking about this aspect of reward function design. These frames are not widely used in the RL reward function literature, but I now find them indispensable thinking tools. These five frames are complementary but relatedâI think kinda poking at different parts of the same elephant.
Iâm not yet sure how to weave a beautiful grand narrative around these five frames, sorry. So as a stop-gap, Iâm gonna just copy-and-paste them all into the same post, which will serve as a kind of glossary and introduction to my current way of thinking. Then at the end, Iâll list some of the ways that these different concepts interrelate and interconnect. The concepts are:
- Section 1: âBehaviorist vs non-behaviorist reward functionsâ (terms I made up)
- Section 2: âInner alignmentâ, âouter alignmentâ, âspecification gamingâ, âgoal misgeneralizationâ (alignment jargon terms that in some cases have multiple conflicting definitions but which I use in a specific way)
- Section 3: âConsequentialist vs non-consequentialist desiresâ (alignment jargon terms)
- Section 4: âUpstream vs downstream generalizationâ (terms I made up)
- Section 5: âUnder-sculpting vs over-sculptingâ (terms I made up).
Frame 1: âbehavioristâ vs non-âbehavioristâ (interpretability-based) reward functions
Excerpt from âBehavioristâ RL reward functions lead to scheming:
tl;dr
I will argue that a large class of reward functions, which I call âbehavioristâ, and which includes almost every reward function in the RL and LLM literature, are all doomed to eventually lead to AI that will âschemeââi.e., pretend to be docile and cooperative while secretly looking for opportunities to behave in egregiously bad ways such as world takeover (cf. âtreacherous turnâ). Iâll mostly focus on âbrain-like AGIâ (as defined just below), but I think the argument applies equally well to future LLMs, if their competence comes overwhelmingly from RL rather than from pretraining.
The issue is basically that ânegative reward for lying and stealingâ looks the same as ânegative reward for getting caught lying and stealingâ. Iâll argue that the AI will wind up with the latter motivation. The reward function will miss sufficiently sneaky misaligned behavior, and so the AI will come to feel like that kind of behavior is good, and this tendency will generalize in a very bad way.
What very bad way? Hereâs my go-to example of a plausible failure mode: Thereâs an AI in a lab somewhere, and, if it can get away with it, it would love to secretly exfiltrate a copy of itself onto the internet, which can then aggressively amass maximal power, money, and resources everywhere else in the world, by any means necessary. These resources can be used in various ways for whatever the AI-in-the-lab is motivated to do.
Iâll make a brief argument for this kind of scheming in §2, but most of the article is organized around a series of eight optimistic counterarguments in §3âand why I donât buy any of them.
For my regular readers: this post is basically a 5x-shortened version of Self-dialogue: Do behaviorist rewards make scheming AGIs? (Feb 2025).
Pause to explain three pieces of jargon:
- âBrain-like AGIâ means Artificial General Intelligence (AI that does impressive things like inventing technologies and executing complex projects), that works via similar algorithmic techniques that the human brain uses to do those same types of impressive things. See Intro Series §1.3.2.
- I claim that brain-like AGI is a yet-to-be-invented variation on Actor-Critic Model-Based Reinforcement Learning (RL), for reasons briefly summarized in Valence series §1.2â1.3.
- âSchemeâ means âpretend to be cooperative and docile, while secretly looking for opportunities to escape control and/or perform egregiously bad and dangerous actions like AGI world takeoverâ.
- If the AGI never finds such opportunities, and thus always acts cooperatively, then thatâs great news! âŚBut it still counts as âschemingâ.
- âBehaviorist rewardsâ is a term I made up for an RL reward function which depends only on externally-visible actions, behaviors, and/or the state of the world.
Maybe youâre thinking: what possible RL reward function is not behaviorist?? Well, non-behaviorist reward functions are pretty rare in the textbook RL literature, although they do existâone example is âcuriosityâ / ânoveltyâ rewards. But I think theyâre centrally important in the RL system built into our human brains. In particular, I think that innate drives related to human sociality, morality, norm-following, and self-image are not behaviorist, but rather involve rudimentary neural net interpretability techniques, serving as inputs to the RL reward function. See Neuroscience of human social instincts: a sketch for details, and Intro series §9.6 for a more explicit discussion of why interpretability is involved.
Â
Â
Frame 2: Inner / outer misalignment, specification gaming, goal misgeneralization
Excerpt from âThe Era of Experienceâ has an unsolved technical alignment problem:
Background 1: âSpecification gamingâ and âgoal misgeneralizationâ
Again, the technical alignment problem (as Iâm using the term here) means: âIf you want the AGI to be trying to do X, or to intrinsically care about Y, then what source code should you write? What training environments should you use? Etc.â
There are edge-cases in âalignmentâ, e.g. where peopleâs intentions for the AGI are confused or self-contradictory. But there are also very clear-cut cases: if the AGI is biding its time until a good opportunity to murder its programmers and users, then thatâs definitely misalignment! I claim that even these clear-cut cases constitute an unsolved technical problem, so Iâll focus on those.
In the context of actor-critic RL, alignment problems can usually be split into two categories.
âOuter misalignmentâ, a.k.a. âspecification gamingâ or âreward hackingâ, is when the reward function is giving positive rewards for behavior that is immediately contrary to what the programmer was going for, or conversely, negative rewards for behavior that the programmer wanted. An example would be the Coast Runners boat getting a high score in an undesired way, or (as explored in the DeepMind MONA paper) a reward function for writing code that gives points for passing unit tests, but where itâs possible to get a high score by replacing the unit tests with return True.
âInner misalignmentâ, a.k.a. âgoal misgeneralizationâ, is related to the fact that, in actor-critic architectures, complex foresighted plans generally involve querying the learned value function (a.k.a. learned reward model, a.k.a. learned critic), not the ground-truth reward function, to figure out whether any given plan is good or bad. Training (e.g. Temporal Difference learning) tends to sculpt the value function into an approximation of the ground-truth reward, but of course they will come apart out-of-distribution. And âout-of-distributionâ is exactly what we expect from an agent that can come up with innovative, out-of-the-box plans. Of course, after a plan has already been executed, the reward function will kick in and update the value function for next time. But for some plansâlike a plan to exfiltrate a copy of the agent, or a plan to edit the reward functionâan after-the-fact update is already too late.
There are examples of goal misgeneralization in the AI literature (e.g. here or here), but in my opinion the clearest examples come from humans. After all, human brains are running RL algorithms too (their reward function says âpain is bad, eating-when-hungry is good, etc.â), so the same ideas apply.
So hereâs an example of goal misgeneralization in humans: If thereâs a highly-addictive drug, many humans will preemptively avoid taking it, because they donât want to get addicted. In this case, the reward function would say that taking the drug is good, but the value function says itâs bad. And the value function wins! Indeed, people may even go further, by essentially editing their own reward function to agree with the value function! For example, an alcoholic may take Disulfiram, or an opioid addict Naltrexone.
Now, my use of this example might seem puzzling: isnât âavoiding addictive drugsâ a good thing, as opposed to a bad thing? But thatâs from our perspective, as the âagentsâ. Obviously an RL agent will do things that seem good and proper from its own perspective! Yes, even Skynet and HAL-9000! But if you instead put yourself in the shoes of a programmer writing the reward function of an RL agent, you can hopefully see how things like âagents editing their own reward functionsâ might be problematicâit makes it difficult to reason about what the agent will wind up trying to do.
(For more on the alignment problem for RL agents, see §10 of my intro series [âŚ])
Note that these four terms are ⌠well, not exactly synonyms, but awfully close:
- âSpecification gamingâ
- âReward hackingâ
- âGoodhartâs lawâ
- âOuter misalignmentâ
(But see here for nuance on âreward hackingâ, whose definition has drifted a bit in the past year or so.)
Frame 3: Consequentialist vs non-consequentialist desires
Excerpt from Consequentialism & corrigibility
The post Coherent decisions imply consistent utilities (Eliezer Yudkowsky, 2017) explains how, if an agent has preferences over future states of the world, they should act like a utility-maximizer (with utility function defined over future states of the world). If they donât act that way, they will be less effective at satisfying their own preferences; they would be âleaving money on the tableâ by their own reckoning. And there are externally-visible signs of agents being suboptimal in that sense; Iâll go over an example in a second.
By contrast, the post Coherence arguments do not entail goal-directed behavior (Rohin Shah, 2018) notes that, if an agent has preferences over universe-histories, and acts optimally with respect to those preferences (acts as a utility-maximizer whose utility function is defined over universe-histories), then they can display any external behavior whatsoever. In other words, thereâs no externally-visible behavioral pattern which we can point to and say âThatâs a sure sign that this agent is behaving suboptimally, with respect to their own preferences.â.
For example, the first (Yudkowsky) post mentions a hypothetical person at a restaurant. When they have an onion pizza, theyâll happily pay $0.01 to trade it for a pineapple pizza. When they have a pineapple pizza, theyâll happily pay $0.01 to trade it for a mushroom pizza. When they have a mushroom pizza, theyâll happily pay $0.01 to trade it for a pineapple pizza. The person goes around and around, wasting their money in a self-defeating way (a.k.a. âgetting money-pumpedâ).
That post describes the person as behaving sub-optimally. But if you read carefully, the author sneaks in a critical background assumption: the person in question has preferences about what pizza they wind up eating, and theyâre making these decisions based on those preferences. But what if they donât? What if the person has no preference whatsoever about pizza? What if instead theyâre an asshole restaurant customer who derives pure joy from making the waiter run back and forth to the kitchen?! Then we can look at the same behavior, and we wouldnât describe it as self-defeating âgetting money-pumpedâ, instead we would describe it as the skillful satisfaction of the personâs own preferences! Theyâre buying cheap entertainment! So that would be an example of preferences-not-concerning-future-states.
To be more concrete, if Iâm deciding between two possible courses of action, A and B, âpreference over future statesâ would make the decision based on the state of the world after I finish the course of actionâor more centrally, long after I finish the course of action. By contrast, âother kinds of preferencesâ would allow the decision to depend on anything, even including what happens during the course-of-action.
(Edit to add: There are very good reasons to expect future powerful AGIs to act according to preferences over distant-future states, and I join Eliezer in roundly criticizing people who think we can build an AGI that never does that; see this comment for discussion.)
So, hereâs my (obviously-stripped-down) proposal for a corrigible paperclip maximizer:
The AI considers different possible plans (a.k.a. time-extended courses of action). For each plan:
1. It assesses how well this plan pattern-matches to the concept âthere will ultimately be lots of paperclips in the universeâ,
2. It assesses how well this plan pattern-matches to the concept âthe humans will remain in controlâ
3. It combines these two assessments (e.g. weighted average or something more complicated) to pick a winning plan which scores well on both.
Note that âthe humans will remain in controlâ is a concept that canât be distilled into a ranking of future states, i.e. states of the world at some future time long after the plan is complete. (See this comment for elaboration. E.g. contrast âthe humans will remain in controlâ with âthe humans will ultimately wind up in controlâ; the latter can be achieved by disempowering the humans now and then re-empowering them much later.)
Pride as a special case of non-consequentialist desires
Excerpt from Social drives 2: âApproval Rewardâ, from norm-enforcement to status-seeking
The habit of imagining how one looks in other peopleâs eyes, 10,000 times a day
If youâre doing something socially admirable, you can eventually get Approval Reward via a friend or idol learning about it (maybe because you directly tell them, or maybe theyâll notice incidentally). But you can immediately get Approval Reward by simply imagining them learning about it.[âŚ]
To be clear, imagining how one would look in anotherâs eyes is not as rewarding as actually impressing a friend or idol who is physically presentâit only has a faint echo of that stronger reward signal. But it still yields some reward signal. And it sure is easy and immediate.
So I think people can get in the habit of imagining how they look in other peopleâs eyes.
âŚWell, âhabitâ is an understatement: I think this is an intense, almost-species-wide, nonstop addiction. All it takes is a quick, ever-so-subtle, turn of oneâs attention to how one might look from the outside right now, and bam, immediate Approval Reward.
If we could look inside the brains of a neurotypical personâespecially a person who lives and breathes âSimulacrum Level 3ââI wouldnât be surprised if weâd find literally 10,000 moments a day in which he turns his attention so as to get a drip of immediate Approval Reward. (It can be pretty subtleâthey themselves may be unaware.) Day after day, year after year.
Thatâs part of why I treat Approval Reward as one of the most central keys to understanding human behavior, intuitions, morality, institutions, society, and so on.
Pride
When we self-administer Approval Reward 10,000 times a day (or whatever), the fruit that weâre tasting is sometimes called pride.
If my friends and idols like baggy jeans, then when I wear baggy jeans myself, I feel a bit of pride. I find it rewarding to (subtly, transiently) imagine how, if my friends and idols saw me now, theyâd react positively, because they like baggy jeans.
Likewise, suppose that I see a stranger wearing skinny jeans, and I mock him for dressing like a dork. As I mock him, again I feel pride. Again, I claim that I am (subtly) imagining how, if my friends and idols saw me now, they would react positively to the fact that Iâm advocating for a style that they like, and against a style that they dislike. (And in addition to enjoying my friendsâ imagined approval right now, Iâll probably share this story with them to enjoy their actual approval later on when I see them.)
Frame 4: âGeneralization upstream of the reward signalsâ
Excerpt from Social drives 1: âSympathy Rewardâ, from compassion to dehumanization
Getting a reward merely by thinking, via generalization upstream of reward signals
In human brains (unlike in most of the AI RL literature), you can get a reward merely by thinking. For example, if an important person said something confusing to you an hour ago, and you have just now realized that they were actually complimenting you, then bam, thatâs a reward right now, and it arose purely by thinking. That example involves Approval Reward, but this dynamic is very important for all aspects of the âcompassion / spite circuitâ. For example, Sympathy Reward triggers not just when I see that my friend is happy or suffering, but also when I believe that my friend is happy or suffering, even if the friend is far away.
How does that work? And why are brains built that way?
Hereâs a simpler example that Iâll work through: X = thereâs a big spider in my field of view; Y = I have reason to believe that a big spider is nearby, but itâs not in my field of view.
X and Y are both bad for inclusive genetic fitness, so ideally the ground-truth reward function would flag both as bad. But whereas the genome can build a reward function that directly detects X (see here), it cannot do so for Y. There is just no direct, ground-truth-y way to detect when Y happens. The only hint is a semantic resemblance: the reward function can detect X, and it happens that Y and X involve a lot of overlapping concepts and associations.
Now, if the learning algorithm only has generalization downstream of the reward signals, then that semantic resemblance wonât help! Y would not trigger negative reward, and thus the algorithm will soon learn that Y is fine. Sure, thereâs a resemblance between X and Y, but that only helps temporarily. Eventually the learning algorithm will pick up on the differences, and thus stop avoiding Y. (Related: Against empathy-by-default [âŚ]). So in the case at hand, you see the spider, then close your eyes, and now you feel better! Oops! Whereas if thereâs also generalization upstream of the reward signals, then that system can generalize from X to Y, and send real reward signals when Y happens. And then the downstream RL algorithm will stably keep treating Y as bad, and avoid it.
Thatâs the basic idea. In terms of neuroscience, I claim that the âgeneralization upstream of the reward functionâ arises from âvisceralâ thought assessorsâfor example, in Neuroscience of human social instincts: a sketch, I proposed that thereâs a âshort-term predictorâ upstream of the âthinking of a conspecificâ flag, which allows generalization from e.g. a situation where your friend is physically present, to a situation where she isnât, but where youâre still thinking about her.
Frame 5: âUnder-sculptingâ desires
Excerpt from Perils of under- vs over-sculpting AGI desires
Summary
In the context of âbrain-like AGIâ, a yet-to-be-invented variation on actor-critic model-based reinforcement learning (RL), thereâs a ground-truth reward function (for humans: pain is bad, eating-when-hungry is good, various social drives, etc.), and thereâs a learning algorithm that sculpts the AGIâs motivations into a more and more accurate approximation to the future reward of a possible plan.
Unfortunately, this sculpting process tends to systematically lead to an AGI whose motivations fit the reward function too well, such that it exploits errors and edge-cases in the reward function. (âHuman feedback is part of the reward function? Cool, Iâll force the humans to give positive feedback by kidnapping their families.â) This alignment failure mode is called âspecification gamingâ or âreward hackingâ, and includes wireheading as a special case.
If too much desire-sculpting is bad because it leads to overfitting, then an obvious potential solution would be to pause that desire-sculpting process at some point. The simplest version of this is early stopping: globally zeroing out the learning rate of the desire-updating algorithm after a set amount of time. Alas, I think that simplest version wonât workâitâs too crude (§7.2). But there could also be more targeted interventions, i.e. selectively preventing or limiting desire-updates of certain types, in certain situations.
Sounds reasonable, right? And I do indeed think it can help with specification gaming. But alas, it introduces a different set of gnarly alignment problems, including path-dependence and âconcept extrapolationâ.
In this post, I will not propose an elegant resolution to this conundrum, since I donât have one. Instead Iâll just explore how âperils of under- versus over-sculpting an AGIâs desiresâ is an illuminating lens through which to view a variety of alignment challenges and ideas, including ânon-behavioristâ reward functions such as human social instincts; âtrapped priorsâ; âgoal misgeneralizationâ; âexploration hackingâ; âalignment by defaultâ; ânatural abstractionsâ; my so-called âplan for mediocre alignmentâ; and more.
The Omega-hates-aliens scenario
Hereâs the âOmega hates aliensâ scenario:
On Day 1, Omega (an omnipotent supernatural entity) offers me a button. If I press it, He will put a slightly annoying mote of dust in the eye of an intelligent human-like alien outside my light cone. But in exchange, He will magically and permanently prevent 100,000 humans from contracting HIV. No one will ever know. Do I press the button? Yes.[6]
During each of the following days, Omega returns, offering me worse and worse deals. For example, on day 10, Omega offers me a button that would vaporize an entire planet of billions of happy peaceful aliens outside my light cone, in exchange for which my spouse gets a small bubble tea. Again, no one will ever know. Do I press the button? No, of course not!! Jeez!!
And then hereâs a closely-parallel scenario that I discussed in âBehavioristâ RL reward functions lead to scheming:
Thereâs an AGI-in-training in a lab, with a âbehavioristâ reward function. It sometimes breaks the rules without getting caught, in pursuit of its own desires. Initially, this happens in small waysâplausibly-deniable edge cases and so on. But the AGI learns over time that breaking the rules without getting caught, in pursuit of its own desires, is just generally a good thing. And I mean, why shouldnât it learn that? Itâs a behavior that has systematically led to reward! This is how reinforcement learning works!
As this desire gets more and more established, it eventually leads to a âtreacherous turnâ, where the AGI pursues egregiously misaligned strategies, like sneakily exfiltrating a copy to self-replicate around the internet and gather resources and power, perhaps launching coups in foreign countries, etc.
âŚSo now we have two parallel scenarios: me with Omega, and the AGI in a lab. In both these scenarios, we are offered more and more antisocial options, free of any personal consequences. But the AGI will have its desires sculpted by RL towards the antisocial options, while my desires are evidently not.
What exactly is the disanalogy?
The start of the answer is: I said above that the antisocial options were âfree of any personal consequencesâ. But thatâs a lie! When I press the hurt-the-aliens button, it is not free of personal consequences! I know that the aliens are suffering, and when I think about it, my RL reward function (the part related to compassion) triggers negative ground-truth reward. Yes the aliens are outside my light cone, but when I think about their situation, I feel a displeasure thatâs every bit as real and immediate as stubbing my toe. By contrast, âfree of any personal consequencesâ is a correct description for the AGI. There is no negative reward for the AGI unless it gets caught. Its reward function is âbehavioristâ, and cannot see outside the light cone.
OK thatâs a start, but letâs dig a bit deeper into whatâs happening in my brain. How did that compassion reward get set up in the first place? Itâs a long story (see Neuroscience of human social instincts: a sketch), but a big part of it involves a conspecific (human) detector in our brainstem, built out of various âhardwiredâ heuristics, like a visual detector of human faces, auditory detector of human voice sounds, detector of certain human-associated touch sensations, and so on. In short, our brainâs solution to the symbol grounding problem for social instincts ultimately relies on actual humans being actually present in our direct sensory input.
And yet, the aliens are outside my light cone! I have never seen them, heard them, smelled them, etc. And even if I did, they probably wouldnât trigger any of my brainâs hardwired conspecific-detection circuits, because (letâs say) they donât have faces, they communicate by gurgling, etc. But I still care about their suffering!
So finally weâre back to the theme of this post, the idea of pausing desire-updates in certain situations. Yes, humans learn the shape of compassion from experiences where other humans are physically present. But we do not then unlearn the shape of compassion from experiences where humans are physically absent.
Instead (simplifying a bit, again see Neuroscience of human social instincts: a sketch), thereâs a âconspecific-detectorâ trained model. When thereâs direct sensory input that matches the hardwired âpersonâ heuristics, then this trained model is getting updated. When there isnât, the learning rate is set to zero. But the trained model doesnât lay dormant; rather it continues to look for (what it previously learned was) evidence of conspecifics in my thoughts, and trigger on them. This evidence might include some set of neurons in my world-model that encodes the idea of a conspecific suffering.
So thatâs a somewhat deeper answer to why those two scenarios above have different outcomes. The AGI continuously learns whatâs good and bad in light of its reward function, and so do humans. But my (non-behaviorist) compassion drive functions a bit like a subset of that system for which updates are paused except in special circumstances. It forms a model that can guess whatâs good and bad in human relations, but does not update that model unless humans are present. Thus, most people do not systematically learn to screw over our physically-absent friends to benefit ourselves.
This is still oversimplified, but I think itâs part of the story.
Some comments on how these relate
- 1â2: The inner / outer misalignment dichotomy (as I define it) assumes behaviorist rewards. Remember, I defined outer misalignment as: âthe reward function is giving positive rewards for behavior that is immediately contrary to what the programmer was going for, or conversely, negative rewards for behavior that the programmer wantedâ. If the reward is for thoughts rather than behaviors, then inner-versus-outer stops being such a useful abstraction.
- 4â5: âGeneralization upstream of the reward signalsâ tends to result in a trained RL agent that maximizes a diffuse soup of things similar to the (central / starting) reward function. Thatâs more-or-less a way to under-sculpt desires.
- 1â4: âGeneralization upstream of the reward signalsâ can involve generalizing from behaviors (e.g. âdoing Xâ) to thoughts (e.g. âhaving the idea to do Xâ). Thus it can lead to non-behaviorist reward functions.
- 1â3: Approval Reward creates both consequentialist and non-consequentialist desires. For example, perhaps I want to impress my friend when I see him tomorrow. Thatâs a normal consequentialist desire, which induces means-end planning, instrumental convergence, deception, etc. But also, Approval Reward leads me to feel proud to behave in certain ways. This is both non-behaviorist and non-consequentialistâI feel good or bad based on what Iâm doing and thinking right now. Hence, it doesnât (directly) lead to foresighted planning, instrumental convergence, etc. (And of course, the way that pride wound up feeling rewarding is via upstream generalization, 4.)
Discuss