Deep Reinforcement Learning: 0 to 100

how you’d teach a robot to land a drone without programming every single move? That’s exactly what I set out to explore. I spent weeks building a game where a virtual drone has to figure out how to land on a platform—not by following pre-programmed instructions, but by learning from trial and error, just like how you learned to ride a bike.

This is Reinforcement Learning (RL), and it’s fundamentally different from other machine learning approaches. Instead of showing the AI thousands of examples of “correct” landings, you give it feedback: “Hey, that was pretty good, but maybe try being more gentle next time?” or “Yikes, you crashed—probably don’t do that again.” Through countless attempts, the AI figures out what works and what doesn’t.

In this post, I’m documenting my journe…

In this post, I’m documenting my journey from RL basics to building a working system that (mostly!) teaches a drone to land. You’ll see the successes, the failures, and all the weird behaviors I had to debug along the way.

1. Reinforcement learning: Overview

A lot of the idea can be related to Pavlov’s dog and Skinner’s rat experiments. The idea is that you give the subject a ‘reward‘ when it does something you want it to do (positive reinforcement) and a ‘penalty‘ when it does something bad (negative reinforcement). Through many repeated attempts, your subject learns from this feedback, gradually discovering which actions lead to success—similar to how Skinner’s rat learned which lever presses produced food rewards.

Fig 1. Pavlov’s classical conditioning experiment (AI-generated image by Google’s Gemini)

In the same fashion, we want a system that will learn to do things (or tasks) such that it can maximize the reward and minimize the penalty. Note this fact about maximizing reward, which will come in later.

1.1 Core Concepts

When talking about systems that can be implemented programmatically on computers, the best practice is to write clear definitions for ideas that can be abstracted. In the study of AI (and more specifically, Reinforcement learning), the core ideas can be boiled down to the following:

Agent (or Actor): This is our subject from the previous section. This can be the dog, a robot trying to navigate a huge factory, a video game NPC, etc.
Environment (or the world): This can be a place, a simulation with restrictions, a video game’s virtual game world, etc. I think of this like, “A box, real or virtual, where the agent’s entire life is confined to; it only knows of what happens within the box. We, as the overlords, can alter this box, while the agent will think that god is exacting his will on his world.”
Policy: Just like in governments, companies, and many more similar entities, ‘policies’ dictate “What actions should be taken when given a certain situation”.
State: This is what the agent “sees” or “knows” about its current situation. Think of it as the agent’s snapshot of reality at any given moment—like how you see the traffic light color, your speed, and the distance to the intersection when driving.
Action: Now that our agent can ‘see’ things in its environment, it may want to do something about its state. Maybe it just woke up from a long night’s slumber, and now it wants to get a cup of coffee. In this case, the first thing it will do is get out of bed. This is an action that the agent will take to achieve its goal, i.e., GET SOME COFFEE!
Reward: Every time the actor executes an action (of its own volition), something may change in the world. For example, our agent got out of bed and started walking towards the kitchen, but then, because it is so bad at walking, it tripped and fell. In this situation, the god (us) rewards it with a punishment for being bad at walking (negative reward). But then the agent makes it to the kitchen and gets the coffee, so the god (us) rewards it with a cookie (positive reward). Fig. 2 Illustration of a theoretical RL system

As you can imagine, most of these key components need to be tailored for the specific task/problem that we want the agent to solve.

2. The Gym

Now that we understand the basics, you might be wondering: how do we actually build one of these systems? Let me show you the game I built.

For this post, I have written a bespoke video game that anyone can access and use to train their own machine learning agent to play the game.

The full code repository can be found on GitHub (please star this). I intend to use this repository for more games and simulation code, along with more advanced techniques that I will implement in my next installments of posts on RL.

Delivery Drone

The delivery drone is a game where the objective is to fly a drone (likely containing deliveries) onto a platform. To win the game, we have to land. To land, we have to meet the following criteria:

Be in landing proximity to the platform
Be slow enough
Be upright (Landing upside down is more like crashing than landing)

All information on how to run the game can be found in the GitHub repository.

Here’s what the game looks like

Sample screenshot of the game Fig. 3 A screenshot of the game that I made for this project

If the drone flies off the screen or touches the ground, it will be considered a ‘crash’ case and thus lead to a failure.

State description

The drone observes 15 continuous values that completely describe its situation:

Landing Success Criteria: The drone must simultaneously achieve:

Horizontal alignment: within platform bounds (|dx| < 0.0625)
Safe approach speed: less than 0.3
Level orientation: tilt less than 20° (|angle| < 0.111)
Correct altitude: bottom of drone touching platform top

It’s like parallel parking—you need the right position, right angle, and moving slowly enough to not crash!

How can someone design a policy?

There are many ways to design a policy. It can be Bayesian (maintaining probability distributions over beliefs), it can be a simple lookup table for discrete states, a hand-coded rule system (“if distance < 10, then brake”), a decision tree, or—as we’ll explore—a neural network that learns the mapping from states to actions through gradient descent.

Effectively, we want something that takes in the aforementioned state, performs some computation using this state, and returns what action should be performed.

Deep Learning to build a policy?

So how do we design a policy that can handle continuous states (like exact drone positions) and learn complex behaviors? This is where neural networks come in.

In case of neural networks (or in deep learning), it is generally best to work with action probabilities, i.e., “What action is likely the best given the current state?”. So, we can define a neural network that will take in the state as a ‘vector’ or ‘collection of vectors’ as input. This vector or collection of vectors has to be constructed from the observed state. For our delivery drone game, the state vector is:

State vector (from our 2D drone game)

The drone observes its absolute position, velocities, orientation, fuel, platform position, and derived metrics. Our continuous state is:

Where each component represents:

All components are normalized to roughly [0,1] or [-1,1] ranges for stable neural network training.

Action space (three independent binary thrusters)

Instead of discrete action combinations, we treat each thruster independently:

Main thruster (upward thrust)
Left thruster (clockwise rotation)
Right thruster (counter-clockwise rotation)

Each action is sampled from a Bernoulli distribution, giving us 3 independent binary decisions per timestep.

Neural-network policy (probabilistic with Bernoulli sampling)

Let fθ(s) be the network outputs after sigmoid activation. The policy uses independent Bernoulli distributions:

Minimal Python sketch (from our implementation)

# build state vector from DroneState
s = np.array([
state.drone_x, state.drone_y,
state.drone_vx, state.drone_vy,
state.drone_angle, state.drone_angular_vel,
state.drone_fuel,
state.platform_x, state.platform_y,
state.distance_to_platform,
state.dx_to_platform, state.dy_to_platform,
state.speed,
float(state.landed), float(state.crashed)
])

# network outputs probabilities for each thruster (after sigmoid)
action_probs = policy(torch.tensor(s, dtype=torch.float32))  # shape: (3,)

# sample each thruster independently from Bernoulli
dist = Bernoulli(probs=action_probs)
action = dist.sample()  # shape: (3,), e.g., [1, 0, 1] means main+right thrusters

This shows how we map the game’s physical observations into a 15-dimensional normalized state vector and produce independent binary decisions for each thruster.

Code setup (part 1): Imports and game socket setup

We first want our game’s socket listener to start. For this, you can navigate to the delivery_drone directory in my repository and run the following command:

pip install -r requirements.txt # run this once for setting up the required modules
python socket_server.py --render human --port 5555 --num-games 1 # run this whenever you need to run the game in socket mode

NOTE: You will need PyTorch to run the code. Please make sure that you have set it up beforehand

import os
import torch
import torch.nn as nn
import math
import numpy as np

from torch.distributions import Bernoulli

# Import the game's socket client
from delivery_drone.game.socket_client import DroneGameClient, DroneState

# setup the client and connect to the server
client = DroneGameClient()
client.connect()

How to design a reward function?

So what makes a good reward function? This is arguably the hardest part of RL (and where I spent a LOT of my debugging time 🫠).

The reward function is the soul of any RL implementation (and trust me, get this wrong and your agent will do the weirdest things). In theory, it should define what ‘good’ behaviour should be learnt and what ‘bad’ behaviour should not be learnt. Each action taken by our agent is characterized by the total accumulated reward for each behaviour trait exhibited by the action. For example, if you want the drone to land gently, you might give positive rewards for being close to the platform and moving slowly, while penalizing crashes or running out of fuel—the agent then learns to maximize the sum of all these rewards over time.

Advantage: A better way to measure effective reward

When training our policy, we don’t just want to know if an action rewarded us—we want to know if it was better than usual. This is the intuition behind the advantage.

The advantage tells us: “Was this action better or worse than what we typically expect?”

In our implementation, we:

Collect multiple episodes and calculate their returns (total discounted rewards)
Compute the baseline as the mean return across all episodes
Calculate advantage = return – baseline for each timestep
Normalize advantages to have mean=0 and std=1 (for stable training)

Why this helps:

Actions with positive advantage → better than average → increase their probability
Actions with negative advantage → worse than average → decrease their probability
Reduces variance in gradient updates (more stable learning)

This simple baseline already gives us much better training than raw returns! It tries to weigh the full sequence of actions against the outcomes (crashed or landed) such that the policy learns to take actions that lead to better advantage.

After a lot of trial and error, I have designed the following reward function. The key insight was to condition rewards on both proximity AND vertical position – the drone must be above the platform to receive positive rewards, preventing exploitation strategies like hovering below the platform.

Short note on inversely (and non-linearly) scaling reward

Often, we want to reward behaviors inversely proportional to certain state values. For example, distance to the platform ranges from 0 to ~1.41 (normalized by window width). We want a high reward when the distance ≈ 0 and a low reward when far away. I use various scaling functions for this:

Plot showing an exponentially decaying function Fig. 4 Gaussian scalar function

Examples for other useful scaling functions

Helper functions:

def inverse_quadratic(x, decay=20, scaler=10, shifter=0):
"""Reward decreases quadratically with distance"""
return scaler / (1 + decay * (x - shifter)**2)

def scaled_shifted_negative_sigmoid(x, scaler=10, shift=0, steepness=10):
"""Sigmoid function scaled and shifted"""
return scaler / (1 + np.exp(steepness * (x - shift)))

def calc_velocity_alignment(state: DroneState):
"""
Calculate how well the drone's velocity is aligned with optimal direction to platform.
Returns cosine similarity: 1.0 = perfect alignment, -1.0 = opposite direction
"""
# Optimal direction: from drone to platform
optimal_dx = -state.dx_to_platform
optimal_dy = -state.dy_to_platform
optimal_norm = math.sqrt(optimal_dx**2 + optimal_dy**2)

if optimal_norm < 1e-6:  # Already at platform
return 1.0

optimal_dx /= optimal_norm
optimal_dy /= optimal_norm

# Current velocity direction
velocity_norm = state.speed
if velocity_norm < 1e-6:  # Not moving
return 0.0

velocity_dx = state.drone_vx / velocity_norm
velocity_dy = state.drone_vy / velocity_norm

# Cosine similarity
return velocity_dx * optimal_dx + velocity_dy * optimal_dy

Code for the current reward function:

def calc_reward(state: DroneState):
rewards = {}
total_reward = 0

# 1. Time penalty - distance-based (penalize more when far)
minimum_time_penalty = 0.3
maximum_time_penalty = 1.0
rewards['time_penalty'] = -inverse_quadratic(
state.distance_to_platform,
decay=50,
scaler=maximum_time_penalty - minimum_time_penalty
) - minimum_time_penalty
total_reward += rewards['time_penalty']

# 2. Distance & velocity alignment - ONLY when above platform
velocity_alignment = calc_velocity_alignment(state)
dist = state.distance_to_platform

rewards['distance'] = 0
rewards['velocity_alignment'] = 0

# Key condition: drone must be above platform (dy > 0) to get positive rewards
if dist > 0.065 and state.dy_to_platform > 0:
# Reward movement toward platform when velocity is aligned
if velocity_alignment > 0:
rewards['distance'] = state.speed * scaled_shifted_negative_sigmoid(dist, scaler=4.5)
rewards['velocity_alignment'] = 0.5

total_reward += rewards['distance']
total_reward += rewards['velocity_alignment']

# 3. Angle penalty - distance-based threshold
abs_angle = abs(state.drone_angle)
max_angle = 0.20
max_permissible_angle = ((max_angle - 0.111) * dist) + 0.111
excess = abs_angle - max_permissible_angle
rewards['angle'] = -max(excess, 0)
total_reward += rewards['angle']

# 4. Speed penalty - penalize excessive speed
rewards['speed'] = 0
speed = state.speed
max_speed = 0.4
if dist < 1:
rewards['speed'] = -2 * max(speed - 0.1, 0)
else:
rewards['speed'] = -1 * max(speed - max_speed, 0)
total_reward += rewards['speed']

# 5. Vertical position penalty - penalize being below platform
rewards['vertical_position'] = 0
if state.dy_to_platform > 0:  # Drone is above platform (GOOD)
rewards['vertical_position'] = 0
else:  # Drone is below platform (BAD!)
rewards['vertical_position'] = state.dy_to_platform * 4.0  # Negative penalty
total_reward += rewards['vertical_position']

# 6. Terminal rewards
rewards['terminal'] = 0
if state.landed:
rewards['terminal'] = 500.0 + state.drone_fuel * 100.0
elif state.crashed:
rewards['terminal'] = -200.0
# Extra penalty for crashing far from target
if state.distance_to_platform > 0.3:
rewards['terminal'] -= 100.0
total_reward += rewards['terminal']

rewards['total'] = total_reward
return rewards

And yes, those magic numbers like 4.5, 0.065, and 4.0? They came from a lot of trial and error. Welcome to RL, where hyperparameter tuning is half art, half science, and half luck (yes, I know that’s three halves).

def compute_returns(rewards, gamma=0.99):
"""
Compute discounted returns (G_t) for each timestep based on the Bellman equation

G_t = r_t + γ*r_{t+1} + γ²*r_{t+2} + ...
"""
returns = []
G = 0

# Compute backwards (more efficient)
for r in reversed(rewards):
G = r + gamma * G
returns.insert(0, G)

return returns

The important thing to note is that reward functions are subject to careful trial and error. One mistake or over-reward here, and the agent goes off in optimizing behaviour that exploits the mistakes. This leads us to reward hacking.

Reward hacking

Reward hacking occurs when an agent finds an unintended way to maximize reward without actually solving the task you wanted it to solve. The agent isn’t “cheating” on purpose—it’s doing exactly what you told it to do, just not what you meant for it to do.

Classic example: If you reward a cleaning robot for “no visible dirt,” it might learn to turn off its camera instead of cleaning!

My painful learning experience: I found this out the hard way. In an early version of my drone landing reward function, I gave the drone points for being “stable and slow” anywhere near the platform. Sounds reasonable, right? Wrong! Within 50 training episodes, my drone learned to just hover in place forever, racking up free points. It was technically optimal for my badly-designed reward function—but actually landing? Nope! I watched it hover for 5 minutes straight before I realized what was happening.

Here’s the problematic code I wrote:

# DO NOT COPY THIS!
# If drone is above the platform (|dx| < 0.0625) and close (distance < 0.25):
corridor_reward = inverse_quadratic(distance, decay=20, scaler=15)  # Up to 15 points
if stable and slow:
corridor_reward += 10  # Extra 10 points!
# Total possible: 25 points per step!

An example of reward hacking in action:

Fig. 5 The drone learnt to hover around the platform and farm rewards Plot showing hacked rewards Fig. 6 Plot that shows that the drone is clearly reward hacking

Making a policy network

As discussed above, we are going to use a neural network as the policy that powers the brain of our agent. Here’s a simple implementation that takes in the state vector and computes a probability distribution over 3 independent actions:

Activate the main thruster
Activate the left thruster
Activate the right thruster

def state_to_array(state):
"""Helper function to convert DroneState dataclass to numpy array"""
data = np.array([
state.drone_x,
state.drone_y,
state.drone_vx,
state.drone_vy,
state.drone_angle,
state.drone_angular_vel,
state.drone_fuel,
state.platform_x,
state.platform_y,
state.distance_to_platform,
state.dx_to_platform,
state.dy_to_platform,
state.speed,
float(state.landed),
float(state.crashed)
])

return torch.tensor(data, dtype=torch.float32)

class DroneGamerBoi(nn.Module):
def __init__(self, state_dim=15):
super().__init__()

self.network = nn.Sequential(
nn.Linear(state_dim, 128),
nn.LayerNorm(128),
nn.ReLU(),
nn.Linear(128, 128),
nn.LayerNorm(128),
nn.ReLU(),
nn.Linear(128, 64),
nn.LayerNorm(64),
nn.ReLU(),
nn.Linear(64, 3),
nn.Sigmoid()
)

def forward(self, state):
if isinstance(state, DroneState):
state = state_to_array(state)

return self.network(state)

Effectively, instead of the action space being a 23 = 8 space, I reduced it to decisions over the three independent thrusters using Bernoulli sampling. This reduction makes optimization easier by treating each thruster independently rather than as one big categorical choice (at least that is what I think—I may be wrong, but it worked for me!).

Training a policy with policy gradients

Learning Strategies: When Should We Update?

Here’s a question that tripped me up early on: should we update the policy after every single action, or wait and see how the whole episode plays out? Turns out, this choice matters a lot.

When you try to optimize based purely on the reward received for an action, it leads to a high variance problem (basically, the training signal is super noisy and the gradients point in random directions!). What I mean by “high variance” is that the optimization algorithm receives extremely mixed signals in the gradient that is used to update the parameters in our policy network. For the same action, the system may emit a specific gradient direction, but then for a slightly different state (but same action) might yield something completely opposite. This leads to slow, and potentially no, training.

There are three ways we could update our policy:

Learning after every action (Per-Step Updates)

The drone fires its thruster once, gets a small reward, and immediately updates its entire strategy. This is like adjusting your basketball form after every single shot—way too reactive! One lucky action that increases the reward doesn’t necessarily mean that the agent did good, and one unlucky action doesn’t mean the agent did bad. The learning signal is just too noisy.

My first attempt: I tried this approach early on. The drone would wiggle around randomly, make one lucky move that got a tiny bit more reward, immediately overfit to that exact move, and then crash repeatedly trying to reproduce it. It was painful to watch—like watching someone learn the wrong lesson from pure chance.

Learning after one complete attempt (Per-Episode Updates)

Better! Now we let the drone try to land (or crash), see how the whole attempt went, and then update. This is like finishing an episode and then thinking about what to improve. At least now we see the full consequences of our actions. But here’s the problem: what if that one landing was just lucky? Or unlucky? We’re still basing our learning on a single data point.

Learning from multiple attempts (Multi-Episode Batch Updates)

This is the sweet spot. We run multiple (6 in my case) drone landing attempts simultaneously, see how they all went, and then update our policy based on the average performance. Some attempts might get lucky, some unlucky, but averaged together, we get a much clearer picture of what actually works. Although this is quite heavy on the computer, if you can run it, it works way better than any of the previous methods. Of course, this method is certainly not the best, but it is quite simple to understand and implement; there are other (and better) methods.

Here’s the code to collect multiple episodes in the drone game:

def collect_episodes(client: DroneGameClient, policy: nn.Module, max_steps=300):
"""
Collect episodes with early stopping

Args:
client: The game's socket client
policy: PyTorch module
max_steps: Maximum steps per episode (default: 300)
"""
num_games = client.num_games

# Initialize storage
all_episodes = [{'states': [], 'actions': [], 'log_probs': [], 'rewards': [], 'done': False}
for _ in range(num_games)]

# Reset all games
game_states = [client.reset(game_id) for game_id in range(num_games)]
step_counts = [0] * num_games  # Track steps per game

while not all(ep['done'] for ep in all_episodes):
# Batch active games
batch_states = []
active_game_ids = []

for game_id in range(num_games):
if not all_episodes[game_id]['done']:
batch_states.append(state_to_array(game_states[game_id]))
active_game_ids.append(game_id)

if len(batch_states) == 0:
break

# Batched inference
batch_states_tensor = torch.stack(batch_states)
batch_action_probs = policy(batch_states_tensor)
batch_dist = Bernoulli(probs=batch_action_probs)
batch_actions = batch_dist.sample()
batch_log_probs = batch_dist.log_prob(batch_actions).sum(dim=1)

# Execute actions
for i, game_id in enumerate(active_game_ids):
action = batch_actions[i]
log_prob = batch_log_probs[i]

next_state, _, done, _ = client.step({
"main_thrust": int(action[0]),
"left_thrust": int(action[1]),
"right_thrust": int(action[2])
}, game_id)

reward = calc_reward(next_state)

# Store data
all_episodes[game_id]['states'].append(batch_states[i])
all_episodes[game_id]['actions'].append(action)
all_episodes[game_id]['log_probs'].append(log_prob)
all_episodes[game_id]['rewards'].append(reward['total'])

# Update state and step count
game_states[game_id] = next_state
step_counts[game_id] += 1

# Check done conditions
if done or step_counts[game_id] >= max_steps:
# Apply timeout penalty if hit max steps without landing
if step_counts[game_id] >= max_steps and not next_state.landed:
all_episodes[game_id]['rewards'][-1] -= 500  # Timeout penalty

all_episodes[game_id]['done'] = True

# Return episodes
return [(ep['states'], ep['actions'], ep['log_probs'], ep['rewards'])
for ep in all_episodes]

The Maximization-Minimization Puzzle

In typical deep learning (supervised learning), we minimize a loss function:

We want to go “downhill” toward lower loss (better predictions).

But in reinforcement learning, we want to maximize total reward! Our goal is:

The problem: Deep learning frameworks are built for minimization, not maximization. How do we turn “maximize reward” into “minimize loss”?

The simple trick: Maximize J(θ) = Minimize -J(θ)

So our loss function becomes:

Now, gradient descent will climb up (more like Gradient Ascend) the reward landscape (because we’re going down the negative reward)!

The REINFORCE Algorithm (Policy Gradient)

The policy gradient theorem (Williams, 1992) tells us how to compute the gradient of expected reward:

(I know, I know—this looks intimidating. But stick with me, it’s actually quite elegant once you see what’s happening!)

Where:

In plain English (because that formula is dense):

If action at led to a high return Gt, increase its probability
If action at led to a low return Gt, decrease its probability
The gradient tells us which direction to adjust the neural network weights

Adding a Baseline (Variance Reduction)

Using raw returns Gt leads to high variance (noisy gradients). We improve this by subtracting a baseline b(st):

The simplest baseline is the mean return:

This gives us the advantage: At=Gt-b

Positive advantage → action was better than average → increase probability
Negative advantage → action was worse than average → decrease probability

Why this helps: Instead of “this action gave reward 100” (is that good?), we have “this action gave 100 when the average is 50” (that’s great!). Relative performance is clearer than absolute.

Our Implementation

In our drone landing code, we use REINFORCE with baseline:

# 1. Collect episodes and compute returns
returns = compute_returns(rewards, gamma=0.99)  # G_t with discounting

# 2. Compute baseline (mean of all returns)
baseline = returns_tensor.mean()

# 3. Compute advantages
advantages = returns_tensor - baseline

# 4. Normalize advantages (extra variance reduction)
advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8)

# 5. Compute loss (note the negative sign!)
loss = -(log_probs_tensor * advantages).mean()

# 6. Gradient descent
optimizer.zero_grad()
loss.backward()
optimizer.step()

We repeat the above loop as many times as we want or till the drone learns to land properly. Have a look at this notebook for more code!

Current Results (reward function is still quite flawed)

After countless hours of tweaking rewards, adjusting hyperparameters, and watching my drone crash in creative new ways, I finally got it working (mostly!). Even though my designed reward function is not perfect, I do think that it is able to teach a policy network. Here’s a successful landing:

Gif showing a good run of the agent Fig. 6 The drone learnt something!

Pretty cool, right? But here’s where things get interesting (and frustrating)…

The persistent hovering problem: A fundamental limitation

Even with the improved reward function that conditions rewards on vertical position (dy_to_platform > 0). The trained policy still exhibits a frustrating behavior: when the drone misses the platform, it learns to descend toward it but then hovers below the platform rather than attempting to land.

I spent over a week staring at reward plots (and altering reward functions), wondering why my “fixed” reward function was still producing this hovering behavior. When I finally plotted the accumulated rewards, the pattern became crystal clear—and honestly, I couldn’t even be mad at the agent for finding this strategy.

What’s happening?

By analyzing the accumulated rewards over an episode where the drone hovers below the platform, I discovered something interesting:

Fig. 7 Gif showing “hovering below platform” problem Fig. 8 Plot that shows that the drone is clearly reward hacking

The plots reveal that:

Distance reward (orange): Accumulates to ~+70 early, then plateaus (no more rewards)
Velocity alignment (green): Accumulates to ~+30 early, then plateaus
Time penalty (blue): Steadily accumulates to ~-250 (keeps getting worse)
Vertical position (brown): Steadily accumulates to ~-200 (penalty for being below)
Total reward: Ends around -400 to -600 (after timeout)

The key insight: The drone descends from above the platform (collecting distance and velocity rewards on the way down), passes through the platform height, and then settles into hovering below instead of completing the landing. Once below, it stops getting positive rewards (notice how the distance and velocity lines plateau around step 50-60) but continues accumulating time penalties and vertical position penalties. However, this strategy is still viable because attempting to land risks an immediate -200 crash penalty, whereas hovering below “only” costs ~-400 to -600 over the full episode.

Why does this happen?

The fundamental issue is that our reward function r(s', a) can only see the current state, not the trajectory. Think about it: at any single timestep, the reward function can’t tell the difference between:

A drone making progress toward landing (approaching from above with controlled descent)
A drone exploiting the reward structure (oscillating to farm rewards)

Both might have dy_to_platform > 0 at a given moment, so they receive identical rewards! The agent isn’t dumb—it’s just optimizing exactly what you told it to optimize.

So what would actually fix this?

To truly solve this problem, I personally think that rewards should depend on state transitions: r(s, a, s') instead of just r(s, a). This would let you reward based on (s being the current state, and s’ prime being the next state):

Progress: Only reward if distance(s') < distance(s) (actually getting closer!)
Vertical improvement: Only reward if the drone is consistently moving upward relative to the platform
Trajectory consistency: Penalize rapid direction changes that indicate oscillation

This is a more principled solution than trying to patch the current reward function with increasingly harsh penalties (which is basically what I tried for a while, and it didn’t really work). The oscillation exploit exists because we’re fundamentally missing information about the trajectory.

In the next post, I’ll explore Actor-Critic methods and techniques that can incorporate temporal information to prevent these exploitation strategies. Stay tuned!

If you find a way to fix this, please reach out to me!

This brings us to the end of this post on “the simplest way to do Deep Reinforcement Learning.”

Next on the list

Actor-Critic systems
DQL
PPO & GRPO
Applying this to systems that require vision 👀

References

Foundational Stuff

Turing, A. M. (1950). “Computing Machinery and Intelligence.”.

Original Turing Test paper

Williams, R. J. (1992). “Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning.” Machine Learning.

The REINFORCE algorithm

Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction. MIT Press.

The foundational textbook
Free online: http://incompleteideas.net/book/the-book-2nd.html

Classical Conditioning & Behavioral Psychology

Pavlov, I. P. (1927). Conditioned Reflexes: An Investigation of the Physiological Activity of the Cerebral Cortex. Oxford University Press.

Classical conditioning experiments

Skinner, B. F. (1938). The Behavior of Organisms: An Experimental Analysis. Appleton-Century-Crofts.

Operant conditioning and the Skinner Box

Policy Gradient Methods

Sutton, R. S., McAllester, D., Singh, S., & Mansour, Y. (1999). “Policy Gradient Methods for Reinforcement Learning with Function Approximation.” Advances in Neural Information Processing Systems.

Theoretical foundations of policy gradients

Schulman, J., Moritz, P., Levine, S., Jordan, M., & Abbeel, P. (2015). “High-Dimensional Continuous Control Using Generalized Advantage Estimation.” arXiv preprint arXiv:1506.02438.

Neural Networks & Deep Learning

Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press.

Reference for neural network fundamentals
Available online: https://www.deeplearningbook.org/

Online Resources

Karpathy, A. “Deep Reinforcement Learning: Pong from Pixels.”

Blog post: http://karpathy.github.io/2016/05/31/rl/
Influential educational resource

Spinning Up in Deep RL by OpenAI

Educational resource: https://spinningup.openai.com/
Excellent policy gradient explanations

Code Repository

Jumle, V. (2025). “Reinforcement Learning 101: Delivery Drone Landing.”

GitHub: https://github.com/vedant-jumle/reinforcement-learning-101

Friend

Singh, Navroop Kaur. (2025): For providing “Positive Vibes & Attention”. Thank you!

All images in this article are either AI-generated (using Gemini), personally made by me, or screenshots & plots that I made.