We need to go back, TO THE GRADIENT

18 Dec, 2025 *

Hey! Tuckers back!

This is the second in a series of blog posts centered around my current personal project. cant-t

Part 1: Can-t stop till you get enough (This post makes a slight assumption that you have read this one as well)

email: tucker.bull.morgan@gmail.com, open to work, resume

So what is Backprop?

If only back prop was so easy

Backprop exists to solve one very specific problem in machine learning: **the l…

18 Dec, 2025 *

Hey! Tuckers back!

This is the second in a series of blog posts centered around my current personal project. cant-t

Part 1: Can-t stop till you get enough (This post makes a slight assumption that you have read this one as well)

email: tucker.bull.morgan@gmail.com, open to work, resume

So what is Backprop?

If only back prop was so easy

Backprop exists to solve one very specific problem in machine learning: the learning part.

You can build a neural network that does all sorts of impressive math, but at the end of the day it needs an answer to a simple question:

“Given that this output was wrong, what should I change so it’s less wrong next time?”

Backpropagation is the mechanism that answers that question. It tells the network how to update itself.

From an implementation point of view, backprop can be broken down into four broad stages:

Calculating loss — turning model output into a single number that tells us how badly we did.
Calculating the gradient for a single operation — figuring out how sensitive an output is to its inputs.
Propagating gradients across the entire network — wiring those local gradients together across the graph.
Updating the parameters — actually changing the weights so the model improves.

If you zoom way out, all of this is just an application of the chain rule from calculus.

dzdx=dzdy·dydx

This says that if (z) depends on (y), and (y) depends on (x), then changes in (x) affect (z) through (y). You don’t need to differentiate everything at once — you compute each piece locally and multiply the results together.

Ok, follow me on this next part — we’re gonna get dense.

That equation is deceptively simple, but it is doing an enormous amount of work.

In theory, the chain rule tells us that gradients compose. In practice, a neural network is not a single chain — it’s a large directed graph of operations. Values branch, merge, get reused, and eventually collapse down into a single scalar loss.

A backpropagation engine is nothing more than a systematic way to apply the chain rule everywhere in that graph.

Each operation knows how to answer a local question:

“If my output changes by this much, how much should each of my inputs change?”

Backprop stitches those answers together by flowing gradients backward from the loss, one operation at a time.

Like the previous post, this one is much more about the practical engineering of backprop than the formal math. The math is here to keep us grounded, but the focus is on how this actually works when you’re building it yourself.

Part 1: Calculating Loss

Loss, generally speaking, is whatever way you are grading your model. You might have an image classification model and your loss would be how good it got at saying what is in an image you presented it. Or a language model and you are scoring it based on how close it gets at the next token prediction.

Do this with enough inputs, add them all up, and now you have some general idea of how well or poorly your model did. This score gets called Loss due to historial factors.

How it feels to train your first network

Which loss function you use depends mostly on the task at hand, for the language model example above, Cross Entropy, where you compare the output as a discrete probability distribution against the true discrete distribution, is the go to. Whereas tasks like training a GAN (Generative Adversarial Network) would be more likely to use a loss like Mean Square Error for image generation tasks.

All of these will in the end output a number, a single number that represents your loss, and you want a single numbe so that for a Neural network to have backprop defined at all. There are reasons for multiple valued loss, but that is beyond the scope of this post.

How do you calculate this loss? Well, you already have most of the tools, Mean Square error, for example, can be calculated as simply as.

// Mean Squared Error

// First calculate the Error
let Diffs = (Output - ExpectedOutput)

// Then square it, we don't actually care about the
// sign of the error
// It also makes it differentiable
// and larger losses get penalized more then smaller ones
let Square = Diffs.pow(2.0);

// Then take the Mean of it, so that the scale of the loss is invariant to batch size
// A larger batch would mean a large loss
// but by taking the mean of it we are using the size of the batch sorta against itself to keep it inline
let Loss = Square.Mean();

Done.

Well not really, but done for this section. Mostly what I am trying to say is that you will just pick up a lot of the tools you need for loss as you write the functions that you need for the rest of the network.

Part 2: Calculating the Gradient for a single Operation

So let’s start with our Loss Tensor from above.

// This one
let Loss = Diffs.Mean();

Now if we remember from the first post, the Loss Tensor has an associated Operation, Mean, which holds an Id of the Tensor we took the mean for.

So for the sake of back propagation we need to calculate the gradient of that Input tensor, EI: how sensitive is the output to a change in the input.

This gif is not reffering to gradients

For the moment let us simplify what we are working with, and assume that our input Tensor has a Shape with one dimension and that Dimension is ten elements long, so Shape[10].

So when we took the Mean of it, we summed up all of those values and then divided them by 10.

μ=110∑i=110xi

When we calculate the gradient we are asking ourselves the question, “How much does changing one element of the input vector change the output”.

What this means from a mechanical standpoint is taking gradient of the Loss tensor(1.0 in this instance), dividing it by the total number of elements and giving each one an equal contribution because during the forward process each element gives an equal contribution (1/n and all of that jazz).

let scaled_grad: Vec<f32> = incoming_grad.iter().map(|&g| g / n as f32).collect();

Then, and this next part is subtle but important.

You take the gradient you just calculated and you ADD it to the existing gradient of the input tensors.

backprop_packet.equation.add_tensor_grad(source_id, scaled_grad);

Gradients can accumulate into tensors during a backwards pass, and if you set them they would blow away any gradients calculated before in this same backwards pass.

To back to our DAG from earlier

A ----> C ----> Loss
B ----^

Let Modify it a little by adding in a new Tensor(D) and have B contribute to more then a single operation

A ----> C --------\
B ----⬦            \
\            +----> Loss
> E -------/
D -----^

So for out B tensor is contributes to two Tensors, C and E. If we where to just set the gradient when we either calcuated A + B or B + D, whichever happened first, we would blow it away when we did the other.

Each operation has its own particular way of calculating its gradient, Sin and Cos for example also have very well known gradients. This also scales to work well with Operations that have one or more inputs like Add or Matrix Multiplication.

I wont go into spelling them all out, but if you go looking for them, coding implementations for the common ones for sure can be found.

That is a good boy

Part 3: Calculating Gradient across the entire network

Ok so I was skipping over a kinda important question, how do you know the order in which to calculate the gradient from your loss variable?

You can understand that you need to make sure that you calculate the gradient of any Tensor before you calculate the gradient of an input Tensor to it.

// Before we calculate the gradients of A, and B
let A = Tensor::random();
let B = Tensor::random();

// We need to calculate the gradient of C
let C = A + B;

// And then preform some opeartion to create a loss
let Loss = C.Mean();

Like you, the human, can think about in your head, and if I gave you a physical graph you could for sure trace it back, but how does a computer know what to do?

Topological Sort

Yeah—that’s it. Topological sort.

We’re taking advantage of the fact that a neural network forms a DAG (Directed Acyclic Graph), which we talked about in the first post.

TL;DR: A DAG has two defining properties:

It has a direction — information flows forward.
It has no cycles — nothing ever loops back onto itself.

A Simple DAG

┌─────┐       ┌─────┐
│  A  │       │  B  │
└──┬──┘       └──┬──┘
│             │
└──────┬──────┘
│
┌───▼───┐
│   C   │
└───┬───┘
│
┌───▼────┐
│  Loss  │
└────────┘

Forward pass flows top → bottom. Backward pass flows bottom → top.

Starting the Backward Pass

When you write:

loss.backward();

You’re telling the system:

“Start from Loss and propagate gradients backward through the graph.”

To do this correctly, we need to visit nodes in reverse dependency order. That’s exactly what a topological sort gives us.

What Topological Sort Guarantees

Topological sort produces an ordered list of nodes such that:

Every operation appears after all operations it depends on.

This ensures that when we compute a gradient for a tensor, all downstream gradients have already been computed.

Walking the Graph

We start by requesting a topological sort rooted at Loss.

node_list = [Loss]

We then add Loss’s immediate ancestors to a neighbors list:

neighbors_list = [C]

Pop C and add it to the node list:

node_list = [Loss, C]

Add C’s ancestors:

neighbors_list = [A, B]

Pop A (no ancestors):

node_list = [Loss, C, A]
neighbors_list = [B]

Finally, pop B:

node_list = [Loss, C, A, B]

Using the Result

We now iterate left → right through this list, computing gradients.

At each node, we compute the gradient (more precisely, the partial Jacobian) of the operation that produced that tensor and accumulate it into its parents.

For example, if Loss = mean(C), then at the Loss node we compute:

∂Loss / ∂C

and propagate that value backward to C.

That’s the core idea: Topological sort guarantees that gradients are computed in the only order that makes mathematical sense.

Top sort is not all that complicated, here is it

Backprop Topological Sort (Rust)

/// a util function to perform topological sort on the tensors of the equation for
/// the backward pass
/// # Arguments
/// 'node' - the id of the tensor that is going to perform the backwards pass
/// 'visited' - hashmap of notes we have already met
/// 'stack' - A stack of nodes we need to look at
fn topological_sort_util(
&self,
node: TensorID,
visited: &mut HashSet<TensorID>,
stack: &mut Vec<TensorID>,
) {
visited.insert(node);

// Assuming 'dependencies' method returns all the nodes that the current node depends on
if let Some(dependencies) = self.tensor_record.get(&node).map(|n| n.dependencies()) {
for dep in dependencies {
if !visited.contains(&dep) {
self.topological_sort_util(dep, visited, stack);
}
}
}

stack.push(node);
}

/// Performs the backward pass of a value through the network
/// it only sets the grad, it does not zero it out, or update the values
/// # Arguments
/// 'starting_value' - the node we are starting on backward pass on
pub fn backward(&mut self, starting_value: TensorID) {
// Get a top sort of all of the nodes of the graph, starting with the node we
// want to start at
let mut visited = HashSet::new();
let mut stack = Vec::new();
self.topological_sort_util(starting_value, &mut visited, &mut stack);

// seed backprop by setting the gradient of the output node(loss) to 1
// this tells the system, how much does loss change as loss changes(the definition of a gradient), well 1:1
// this lets backprop apply the chain rule backwards
let one_grad = vec![1.0];
self.set_tensor_grad(starting_value, one_grad);

while let Some(node) = stack.pop() {
// backward_for_value is a disambiguation function that will find the right backwards function to run
// so about here is where the code from above would be run
let _ = self.backward_for_value(node);
}
}

And with that you can now have the loss of your graph flow backwards

Also you can see where I set the gradient of the tensor we are passing back to 1, and I promised I would explain that. Basically since gradients are a measure of how sensitive an output is to its inputs, for the last node in the network it is perfectly sensitive to its input, so 1.0.

Next, why Operations and not Tensors? Well because a Tensor can contribute to multiple Operations, each one has its own unique gradient that you need to calculate it and this keeps it pretty simple. Each Operation in can-t is a Forwards <-> Backwards pair, here is my sin operation

impl Tensor {
/// Applies the sin function element-wise to the tensor
/// Returns a new tensor with the same shape where each element x is replaced with sin(x)
pub fn sin(&self) -> Tensor {
let result: Vec<f32> = get_equation()
.get_data_flat_buffer(self.id)
.iter()
.map(|x| x.sin())
.collect();
let tensor = Tensor::create_tensor_data_and_shape_and_operation(
self.shape,
result,
Operation::Sin(self.id),
);
return tensor;
}
}

/// Handles calculating and passing back the gradient of a sin operation
/// The derivative of sin(x) is cos(x)
pub fn backward_for_sin(backprop_packet: BackproagationPacket) {
if let Operation::Sin(source_id) = backprop_packet.operation {
let in_gradient = backprop_packet
.equation
.get_grad_flat_buffer(backprop_packet.incoming_grad);
let source_data = backprop_packet.equation.get_data_flat_buffer(source_id);

let updated: Vec<f32> = in_gradient
.iter()
.zip(source_data.iter())
.map(|(grad, &x)| grad * x.cos())
.collect();

backprop_packet.equation.add_tensor_grad(source_id, updated);
} else {
panic!("Wrong operation for backward sin");
}
}

With the Operation holding all of the information for the Backwards pass needs to calculate the gradient.

And with all of that.. well you are almost able to train your network, just one last piece.

Part 4: Updating the parameters of the network itself

Ok last step, lets update those tensors.

So you have calculated all of the gradients that you need to, now what? Well you need to update the weights of the Tensors, your Learnable Parameters.

Something might have seemed off about some of this until now.

There are three types of Tensors in the network.

1. Inputs // The inputs into the network, this could be an image, a piece of text, whatever, but it still needs to be held in a tensor so that network can consume it
2. Learnable Parameters // These are the weights of the networks, when you hear about a new LLM having a billion+ parameters, this is what they are talking about

3. Ephemeral Tensors that have a gradient but nothing to learn. // There are many of these in a network, and they arise from just doing calculations, the results of which have to live somewhere before they get used

How do you know the difference? Because like, you don’t want to update the Input tensors…. Then your input would change and that is bad. You don’t want to update the Ephemeral Tensors, it makes no sense, they get thrown away after you are done with one forward and backward pass of the network.

So lets break this down with a concrete example

Like in a Linear layer, you have

Tensor Input;
Tensor Weight;
Tensor Bias;

To Start

And you create a few as you calculate the layer.

// @ is the symbol for MatMul in pytorch
Tensor InputXWeight = Input @ Weight;
Tensor InputXWeightPlusBias = InputXWeight + Bias;

These need gradients so they can be passed back, but they don’t need to have their values updated, you are just going to override it when you perform the next forward pass. Whereas Weight and Bias you do want to update, THAT IS THE WHOLE POINT OF THIS PROCESS.

So how do I do it, what magic do I perform to know the difference.

weight.set_requires_grad(true);

Yeah, sometimes the dumb way just works. When I create a Linear layer and I allocate Weight and Bias for the first time I mark them as needing grad. Pytorch has a similar function

Every tensor still has their gradient calculating, but only those that have requires_grad set to true will have it updated when we go to do that. Simplest way to do that?

Just loop over all of your InternalTensor records, and if it requires grad, add their grad to their data.

But generally it is more complicated than that.

Me if I started talking about optimizers

Generally, if you do that, it is gonna be a bad time. Why, well because like any good vector your gradient has two parts.

1. A Direction(where it is pointing)
2. A Magnitude(How strongly it is pointing in that direction)

And you really want #1, and you want a little of #2.

So really your update is something more like

Data[i] = data[i] - grad[i] * learning_rate;

Then make sure you zero out the gradients before you run another input through the graph and calculate a second backwards pass.

This is the most basic approach, this is "Stochastic Gradient Descent" so long as you pair it with random input sampling for each forward -> backwards run.

Others optimizer algorithms such as AdaGrad, RMSProp and Adam exist, and mostly work by keeping a second set of parameters that act in some way as "velocity" for a particular parameter in a weight matrix.

Whatever form you choose, you do that enough times again and again in a row and that is how you train a neural network.

Quick Recap

You need to calculate a Loss to grade your network
Calculate the gradient of the Operations with respect to that Loss
Do that for every operation in the graph
Then you update your weights based on that gradient, and then you have a training loop

Hopefully the next post wont take as much time, but it going to be how to think about memory in your graph

So what is Backprop?

So what is Backprop?

Part 1: Calculating Loss

Part 2: Calculating the Gradient for a single Operation

Part 3: Calculating Gradient across the entire network

Topological Sort

A Simple DAG

Starting the Backward Pass

What Topological Sort Guarantees

Walking the Graph

Using the Result

Part 4: Updating the parameters of the network itself

Quick Recap

Similar Posts