Computer-Science Reinforcement Learning Got Rewards Wrong

Computer-science Reinforcement Learning got Rewards Wrong

In a recent blog post, Ben Recht described the Reinforcement Learning (RL) setup as:

Paraphrasing Thorndike’s Law of Effect, Lior defines reinforcement learning as the iterative process:

Receive external validation on how good you’re currently doing

Adjust what you’re currently doing so that you are better the next time around.

Whether or not this is how humans or animals learn, this is a spot-on definition of computer scientific reinforcement learning.

While this is not, in fact, how Lior defines RL, Ben is not wrong. This is how (most?) RL computer-science re…

Computer-science Reinforcement Learning got Rewards Wrong

In a recent blog post, Ben Recht described the Reinforcement Learning (RL) setup as:

Paraphrasing Thorndike’s Law of Effect, Lior defines reinforcement learning as the iterative process:

Receive external validation on how good you’re currently doing

Adjust what you’re currently doing so that you are better the next time around.

Whether or not this is how humans or animals learn, this is a spot-on definition of computer scientific reinforcement learning.

While this is not, in fact, how Lior defines RL, Ben is not wrong. This is how (most?) RL computer-science researchers think of RL.

But reading Ben’s text elucidated to me a point which in which I think RL got things wrong. Those who know me, know that I think the RL community got many things wrong. However, most of these the things I think RL got wrong are in the implementation level, on how the abstract setup is translated into practice. This one, in contrast, is different. Here, I argue they got things wrong already on the formalization level. Fortunately, this particular mistake is one that, I think, is actually quite easy to fix. I think RL is wrong about how it thinks about rewards.

Here is Sutton and Barto’s classic RL textbook:

A reward signal defines the goal in a reinforcement learning problem. On each time step, the environment sends to the reinforcement learning agent a single number, a reward. The agent’s sole objective is to maximize the total reward it receives over the long run. The reward signal thus defines what are the good and bad events for the agent. In a biological system, we might think of rewards as analogous to the experiences of pleasure or pain. They are the immediate and defining features of the problem faced by the agent. The reward sent to the agent at any time depends on the agent’s current action and the current state of the agent’s environment. The agent cannot alter the process that does this. The only way the agent can influence the reward signal is through its actions, which can have a direct eﬀect on reward, or an indirect eﬀect through changing the environment’s state.
  				-- Reinforcement Learning: an Introduction. Chapter 1.

In the most basic setup, an RL Agent observes the environment, acts, and obtains a reward. It then updates its policy etc. Do you see the issue here? The reward is modeled as part of the environment. The agent acts in the environment, and the environment provides back a reward. The reward is external to the agent: it is part of the world.

Here is Sutton and Barto again:

Rewards, too, presumably are computed inside the physical bodies of natural and artificial learning systems, but are considered external to the agent.

The general rule we follow is that anything that cannot be changed arbitrarily by the agent is considered to be outside of it and thus part of its environment.
  				-- Reinforcement Learning: an Introduction. Chapter 3.

I argue that this is a very weird, unnatural, and just plain wrong way to think about rewards. It is not how cognitive scientists thinks about rewards in RL, for example, and also where Ben’s description above departed from Lior’s original one.1 This is just not how the world actually works.

When we act in the world, the world does not repay us with rewards, it only changes its state. It is us that interpret the state change and translate it into a reward. There is no reason RL agents should be any different. The reward mechanism should be part of the agent, not of the environment. The setup description should change to:

The agent acts in the environment.
The environment changes.
The agent observes the new environment.
The agent translates the observation into a reward.

In terms of implementation, this change is compatible with all current RL algorithms. We just need to consider the reward function as part of the agent and not part of the environment. Nothing profoundly changes. An easy change!

But, if nothing really changes, why bother? Isn’t this just playing with notation? To some extent it is. But I think notations matters. It affects how we think of things. And the view in which the reward computation is part of the agent is just more accurate, which is its own reward (ha!).

Beyond just being more accurate, this updated setup in which reward computation is internal to the agent does also bring some benefits. For example, it conceptually allows different agents to learn in the same environment, while ending up with different policies, as each agent translates the same observations to different rewards. In other words, it allows an agent to have goals. The goal of the agent affects how it translates the observations about the environment into rewards, ending up with different policies. This is relevant already when learning to play games. Different learners can set different goals, such as "win the game as fast as possible" (speed running), "win the game while shooting as few monsters as possible" (pacifism challenge), or "explore as many screens as possible". Each of these is based on a different translation of the environment to rewards, and each will result in a different playing style. These changes in rewards are due to the agent and its goals, not due to the environment. The game is the same game, and it presents the agent with itself, not with a stream of rewards. The rewards are internal to the player.

Going back to the second Sutton and Barto quote above, its second paragraph is wrong because the reward certainly can be changed arbitrarily by the agent. The environment cannot be change. But the agent’s interpretation of the environment, including the translation of observations into rewards, most certainly can be changed by the agent. And it should be.

Once we start thinking in terms of the reward translation being part of the agent and not part of the environment, we can also start thinking of ways in which the agent can affect the reward translation mechanism. Maybe it can be dynamic and change across time. Maybe it can be tied to the policy. Maybe it can be learned. Many possible options, just from changing the formal setup a bit. Just by changing the notation. Because notation matters.

If you believe in reinforcement learning, stop thinking of the reward as being part of the environment, and think of it as being part of the agent. It is better.

(Some may argue that by moving the reward computation into the learner, our learner becomes less general: we are overfitting the learner to a particular environment by forcing it to contain a heuristic for translating the environment observations into a reward signal. This is no longer a general-purpose agent, but an environment specific one. True. But I argue that this ia actually a benefit of the proposed setup, not a shortcoming. It highlights the fact that the reward heuristic is important, and that its existence makes RL algorithms less general. Hiding this heuristic computation in the environment does not make the overall setup more general, it just hides uncomfortable things under the rug.)

Footnotes

Indeed, Lior shares the cognitive science view. ↩

Computer-science Reinforcement Learning got Rewards Wrong

Computer-science Reinforcement Learning got Rewards Wrong

Footnotes

Similar Posts