The Reinforcement Learning Handbook: A Guide to Foundational Questions

the fundamental concepts you need to know to understand Reinforcement Learning!

We will progress from the absolute basics of “what even is RL” to more advanced topics, including agent exploration, values and policies, and distinguish between popular training approaches. Along the way, we will also learn about the various challenges in RL and how researchers have tackled them.

At the end of the article, I will also share a YouTube video I made that explains all the concepts in this article in a visually engaging way. If you are not much of a reader, you can check out that companion video instead!

Note: All images are produced by the author unless otherwise specified.

Reinforcement Learning Basics

Suppose you want to train an AI model to learn how to navigate an obstacle co…

the fundamental concepts you need to know to understand Reinforcement Learning!

Note: All images are produced by the author unless otherwise specified.

Reinforcement Learning Basics

Suppose you want to train an AI model to learn how to navigate an obstacle course. RL is a branch of Machine Learning where our models learn by collecting experiences – taking actions and observing what happens. More formally, RL consists of two components – the agent and the environment.

The Agent

The learning process involves two key activities that happen over and over again: exploration and training. During exploration, the agent collects experiences in the environment by taking actions and finding out what happens. And then, during the training activity, the agent uses these collected experiences to improve itself.

The agent collects experiences in the environment, and uses these experience to train a policy

The Environment

Once the agent selects an action, the environment updates. It also returns a reward depending on how well the agent is doing. The environment designer programs how the reward is structured.

For example, suppose you are working on an environment that teaches an AI to avoid obstacles and reach the goal. You can program your environment to return a positive reward when the agent is moving closer to the goal. But if the agent collides with an obstacle, you can program it to receive a large negative reward.

In other words, the environment provides a positive reinforcement (a high positive reward, for example) when the agent does something good and a punishment (a negative reward for example) when it does something bad.

Although the agent is oblivious to how the environment actually operates, it can still determine from its reward patterns how to pick optimal actions that lead to maximum rewards.

The environment and the agent are at the center of RL

Policy

At each step, the agent AI observes the current state of the environment and selects an action. The goal of RL is to learn a mapping from observations to actions, i.e. “given the state I am observing, what action should I choose”?

In RL terms, this mapping from the state to action is also called a policy.

This policy defines how the agent behaves in different states, and in *deep *reinforcement learning we learn this function by training some kind of a *deep *neural network.

Reinforcement Learning

The agent observes state S, queries a network to generate action A. The environment executes the action and returns a reward r and the next state s’. This continues until the episode terminates. Each step the agent takes is later used to train the agent’s policy network.

Understanding the distinctions and interplay between the agent, the policy, and the environment is very integral to understand Reinforcement Learning.

The Agent is the learner that explores and takes actions within the environment
The Policy is the strategy (often a neural network) that the agent uses to determine which action to take given a state. In RL, our ultimate goal is to train this strategy.
The Environment is the external system that the agent interacts with, which provides feedback in the form of rewards and new states.

Here is a quick one-liner definition you should remember:

In Reinforcement Learning, the agent follows a policy to select actions within the environment.

Observations and Actions

The agent explores the environment by taking a sequence of “steps”. Each step is one decision. The agent observes the environment’s state. Decides* *on an action. Receives a reward. Observes the next state. In this section, let’s understand what observations and actions are.

Observation

Observation is what the agent sees from the environment – the information it receives about the environment’s current state. In an obstacle navigation environment, the observation might be LiDAR projections to detect the obstacles. For Atari games, it might be a history of the last few pixel frames. For text generation, it might be the context of the generated tokens so far. In chess, it is the position of all the pieces, whose move it is, etc.

The observation ideally contains all the information the agent needs to take an action.

The action space is all the available decisions the agent can take. Actions can be discrete or continuous. A discrete action space is when the agent has to choose between a specific set of categorical decisions. For example, in Atari games, the actions might be the buttons of an Atari controller. For text generation, it is to choose between all the tokens present in the model’s vocabulary. In chess, it could be a list of available moves.

Example of an observation and action picked by an RL agent learning to navigate an obstacle course

The environment designer can also choose a continuous action space – where the agent generates continuous values to take a “step” in the environment. For example, in our obstacle navigation example, the agent can choose the x and y velocities to get a fine grain control of the movement. In a human character control task, the action is often to output the torque or target angle for every joint in the character’s skeleton.

The most important lesson

But here is something very important to understand: To the agent and the policy – the environment and its specifics can be a complete black box. The agent will receive vector-state information as an observation, generate an action, receive a reward, and later learn from it.

So in your mind, you can consider the agent and the environment as two separate entities. The environment defines the state space, the action space, the reward strategies, and the rules.

These rules are decoupled from how the agent explores and how the policy is trained on the collected experiences.

When studying a research paper, it is important to clarify in our mind which aspect of RL we are reading about. Is it about a new environment? Is it about a new policy training method? Is it about an exploration strategy? Depending on the answer, you can treat other things as a black box.

Exploration

How does the agent explore and collect experiences?

Every RL algorithm must solve one of the largest dilemmas in training RL agents – exploration vs exploitation.

Exploration means trying out new actions to gather information about the environment. Imagine you are learning to fight a boss in a difficult video game. At first, you are going to try different approaches, different weapons, spells, random things just to see what sticks and what doesn’t.

However, once you start seeing some rewards, like consistently deal damage to the boss, you will stop exploring and start exploiting the strategy you have already acquired. Exploitation means greedily picking actions you think will get the best rewards.

A good RL exploration strategy must balance exploration and exploitation.

A popular exploration strategy is Epsilon-Greedy, where the agent explores with a random action a fraction of the time (defined by a parameter epsilon), and exploits its best-known action the rest of the time. This epsilon value is usually high at the start and is gradually decreased to favor exploitation as the agent learns.

Epsilon Greedy is an exploration method where the RL Agent selects a random action from time to time

Epsilon greedy only works in discrete action spaces. In continuous spaces, exploration is often handled in two popular ways. One way is to add a bit of random noise to the action the agent decides to take. Another popular technique is to add an entropy bonus to the loss function, which encourages the policy to be less certain about its choices, naturally leading to more varied actions and exploration.

Some other ways to encourage exploration are:

Design the environment to use random initialization of states at the beginning of the episodes.
Intrinsic exploration methods where the agent acts out of its own “curiosity.” Algorithms like Curiosity and RND reward the agent for visiting novel states or taking actions where the outcome is hard to predict.

I cover these fascinating methods in my Agentic Curiosity video, so be sure to check that out!

Training Algorithms

A majority of research papers and academic topics in Reinforcement Learning are about optimizing the agent’s strategy to pick actions. The goal of optimization algorithms is to learn actions that maximize the long-term expected rewards.

Let’s take a look at the different algorithmic choices one by one.

Model-Based vs Model-Free

Alright, so our agent has explored the environment and collected a ton of experience. Now what?

Does the agent learn to act directly from these experiences? Or does it first try to model the environment’s dynamics and physics?

One approach is model-based learning. Here, the agent first uses its experience to build its own internal simulation, or a world model. This model learns to predict the consequences of its actions, i.e., given a state and action, what is the resulting next state and reward? Once it has this model, it can practice and plan entirely within its own imagination, running thousands of simulations to find the best strategy without ever taking a risky step in the real world.

Model-based RL learns a separate model to learn how the environment works

This is particularly useful in environments where collecting real world experience can be expensive – like robotics or self-driving cars. Examples of Model-Based RL are: Dyna-Q, World Models, Dreamer, etc. I’ll write a separate article someday to cover these models in more detail.

The second is called model-free learning. This is what the rest of the article is going to cover. Here, the agent treats the environment as a black box and learns a policy directly from the collected experiences. Let’s talk more about Model-free RL in the next section.

Value-Based Learning

There are two main approaches to model-free RL algorithms.

Value-based algorithms learn to evaluate how good each state is. Policy-based algorithms learn directly how to act in each state.

Value Based vs Policy Based methods

In value-based methods, the RL agent learns the “value” of being in a specific state. The value of a state literally means how good the state is. The intuition is that if the agent knows which states are good, it can pick actions that lead to those states more regularly.

And thankful there is a mathematical way of doing this – the Bellman Equation.

V(s) = r + γ * max V(s’).

This recurrence equation basically says the value V(s) of state s is equal to the immediate reward r of being in the state plus the** value**** **of the best next-state s‘ the agent can reach from s. Gamma (γ) is a discounted factor (between 0 and 1) that nerfs the goodness of the next state. It essentially decides how much the agent cares about rewards in the distant future versus immediate rewards. A γ close to 1 makes the agent “far-sighted,” whereas a γ close to 0 makes the agent “short-sighted,” greedily caring almost only about the very next reward.

Q-Learning

We learnt the intuition behind state values, but how do we use that information to learn actions? The Q-Learning equation answers this.

Q(s, a) = r + γ * max_a Q(s’, a’)

The Q-value Q(s,a) is the *quality-value *of the action a in state s. The above equation basically states: The quality of an action a in state s is the immediate reward r you get from being in state s, plus the discounted quality value of the next best action.

So in summary:

Q-values are the quality values of each action in each state.
V-values are the value of a specific state; it is equal to the maximum Q-value of all actions in that state.
Policy π at a specific state is the action that has the highest Q-value in that state. Q-values, State values, and Policy are deeply interconnected

To learn more about Q-Learning, you can research Deep Q Networks, and their descendants, like Double Deep Q Networks and Dueling Deep Q Networks.

Value-based learning trains RL agents by learning the value of being in specific states. However, is there a direct way to learn optimal actions without needing to learn state values? Yes.

Policy learning methods directly learn optimal action strategies without explicitly learning state values. Before we learn how, we must learn another important concept first. Temporal Difference Learning vs Monte Carlo Sampling.

TD Learning vs MC Sampling

How does the agent consolidate future experiences to learn?

In Temporal Difference (TD) Learning, the agent updates its value estimates after every single step using the Bellman equation. And it does so by seeing its own estimate of the Q-value in the next state. This strategy is called 1-step TD Learning, or one-step Temporal Difference Learning. You take one step and update your learning based on your past estimates.

In TD Learning, we take one step and use the value estimates of the next state

The second option is called Monte-Carlo sampling. Here, the agent waits for the entire episode to finish before updating anything. And then it uses the complete return from the episode: Q(s,a) = r₁ + γr₂ + γ²r₃ + … + γⁿrₙ

In MC Sampling, we complete the entire episode to compute the estimates from actual rewards

Trade-offs between TD Learning and MC Sampling

TD Learning is pretty cool coz the agent can learn something from every single step, even before it completes an episode. Meaning you can save your collected experiences for a long time and keep training even on old experiences, but with new Q-values. However, TD learning is heavily biased by the agent’s current estimate of the state. So if the agent’s estimates are wrong, it will keep reinforcing those wrong estimates. This is called the “bootstrapping problem.”

On the other hand, Monte Carlo learning is always accurate because it uses the true returns from actual episodes. But in most RL environments, rewards and state transitions can be random. Also, as the agent explores the environment, its own actions can be random, so the states it visits during rollout are also random. This results in the pure TD-Learning method suffering from high variance issues as returns can vary dramatically between episodes.

Policy Gradients

Alright, now that we have understood the concept of TD-Learning vs MC Sampling, it’s time to get back to Policy-Based Learning methods.

Recall that value-based methods like DQN first have to explicitly calculate the value, or Q-value, for every single possible action, and then they pick the best one. But it is possible to skip this step, and Policy Gradient methods like REINFORCE do exactly that.

Policy gradient methods input the state and output the probability of actions to take

In REINFORCE, the policy network outputs probabilities for each action, and we train it to increase the probability of actions that lead to good outcomes. For discrete spaces, PG methods output the probability of each action as a categorical distribution. For continuous spaces, PG methods output as Gaussian distributions, predicting the mean and standard deviation of each element in the action vector.

So the question is, how exactly do you train such a model that directly predicts action probabilities from states?

Here is where the Policy Gradient Theorem comes in. In this article, I will explain the core idea intuitively.

Our policy gradient model is often denoted in the literature as pi_theta(a|s). Here, theta denotes the weights of the neural network. pi_theta(a|s) is the predicted probability of action a in state s by neural network theta.
From a newly initialized policy network, we let the agent play out a full episode and collect all the rewards.
For every action it took, figure out the total discounted return that came after it. This is done using the Monte Carlo approach.
Finally, to actually train the model, the policy gradient theorem asks us to maximize the formula provided in the figure below.
If the return was high, this update will make that action more probable in the future by increasing pi(a|s). If the return was negative, this update will make the action less probable by reducing the pi(a|s). Policy Gradients

The distinction between Q-Learning and REINFORCE

One of the core differences between Q-Learning and REINFORCE is that Q-Learning uses 1-step TD Learning, and REINFORCE uses Monte Carlo Sampling.

By using 1-step TD, Q-learning must determine the quality value Q of each state-action possibility. Because recall that in 1-step TD the agent can take just one step in the environment and determine a quality score of the state.

On the other hand, with Monte Carlo sampling, the agent does not need to rely on an estimator to learn. Instead, it uses actual returns observed during exploration. This makes REINFORCE “unbiased” with the caveat that it requires multiple samples to correctly estimate the value of a trajectory. Furthermore, the agent cannot train until it fully finishes a trajectory (that is reach a terminal state), and it cannot reuse trajectories after the policy network updates.

In practice, REINFORCE often leads to stability issues and sample inefficiency. Let’s talk about how Actor Critic addresses these limitations.

Advantage Actor Critic

If you try to use vanilla REINFORCE on most complex problems, it will struggle, and the reason why is twofold.

The first is because it suffers from high variance coz it’s a Monte Carlo sampling method. Second, it has no sense of baseline. Like, imagine an environment that always gives you a positive reward, then the returns will never be negative, so REINFORCE will increase the probabilities of all actions, albeit in a disproportionate way.

We don’t want to reward actions just for getting a positive score. We want to reward them for being better than average.

And that is where the concept of advantage becomes important. Instead of just using the raw return to update our policy, we’ll subtract the expected return for that state. So our new update signal becomes:

Advantage = The Return you got – The Return you expected

While Advantage gives us a baseline for our observed returns, let’s also discuss the concept of Actor Critic methods.

Actor Critic combines the best of Value-Based Methods (like DQN) and the best of Policy-Based Methods (like REINFORCE). Actor Critic methods train a separate “critic” neural network that is only trained to evaluate states, much like the Q-Network from earlier.

The actor method, on the other hand, learns the policy.

Advantage Actor Critic

Combining Advantage and Actor critics, we can understand how the popular A2C algorithm works:

Initialize 2 neural networks: the policy or actor network, and the value or critic network. The actor network inputs a state and outputs action probabilities. The critic network inputs a state and outputs a single float representing the state’s value.
We generate some rollouts in the environment by querying the actor
We update the critic network using either TD Learning or Monte Carlo Learning. There are more advanced approaches, like Generalized Advantage Estimates as well, that combine the two approaches for more stable learning.
We evaluate the advantage by subtracting the observed return from the average return generated by the Critic Network
Finally, we update the Policy network by using the advantage and the policy gradient equation.

Actor-critic methods solve the variance problem in policy gradients by using a value function as a baseline. PPO (Proximal Policy Optimization) extends A2C by adding the concepts of “trust regions” into the learning algorithm, which prevents excessive changes to the network weights during learning. We won’t get into details about PPO in this article; maybe someday we will open that Pandora’s box.

Conclusion

This article is a companion piece to the YouTube video below I made. Feel free to check it out, if you enjoyed this read.

Every algorithm makes specific choices for each question, and these choices cascade through the entire system, affecting everything from sample efficiency to stability to real-world performance.

In the end, creating an RL algorithm is about answering these problems by making your choices. DQNs choose to learn values. policy methods directly learns a **policy**. Monte Carlo methods update after a full episode using actual returns – this makes them unbiased, but they have high variance because of the stochastic nature of RL exploration. TD Learning instead chooses to learn at every step based on the agent’s own estimates. Actor Critic methods combine DQNs and Policy Gradients by learning an actor and a critic network separately.

Note that there’s a lot we didn’t cover today. But this is a good base to get you started with Reinforcement Learning.

That’s the end of this article, see you in the next one! You can use the links below to discover more of my work.

**My Patreon: ** https://www.patreon.com/NeuralBreakdownwithAVB

My YouTube channel: https://www.youtube.com/@avb_fj

**Follow me on Twitter: ** https://x.com/neural_avb

Read my articles: https://towardsdatascience.com/author/neural-avb/

Reinforcement Learning Basics

Reinforcement Learning Basics

The Agent

The Environment

Policy

Reinforcement Learning

Observations and Actions

Observation

The most important lesson

Exploration

Training Algorithms

Model-Based vs Model-Free

Value-Based Learning

Q-Learning

TD Learning vs MC Sampling

Trade-offs between TD Learning and MC Sampling

Policy Gradients

Advantage Actor Critic

Conclusion

Similar Posts