Understanding Backpropagation with Simple Derivatives

Backpropagation is the backbone of deep learning, responsible for the training phase of neural networks. Despite its importance, it is often left out of most explanations of neural networks due to its complexity. Many summaries of backpropagation either assume a level of background knowledge that is unnecessary for understanding the algorithm or focus explicitly on implementation, leaving a working program but no insight.

This article will explain backpropagation using the bare minimum amount of knowledge required —in this case, only basic derivatives— and to an extent in which you can comfortably implement your own neural networks. For this, a few detours are needed, starting with a refresher on neural networks.

Neural Networks

The basic neural network structure consists of a series of layers made up of multiple neurons. Each neuron is connected to every neuron in the previous layer by a weight, which represents the strength of the connection. A small neural network may look something like this:

The value of each neuron is calculated by adding up the values of the previous layer's neurons multiplied with their respective weights, plus an extra term called the bias ( $\displaystyle{ b }$ ). All of this is passed through an activation function (generally written as $\displaystyle{ \sigma }$ ) giving us a final result. For example, the value of $\displaystyle{ z _{ 0 } }$ can be calculated as:

\displaystyle{ z _{ 0 } = \sigma \left( w _{ 0 } y _{ 0 } + w _{ 1 } y _{ 1 } + w _{ 2 } y _{ 2 } + b \right) }

Where $\displaystyle{ w _{ i } }$ is the weight of the connection between $\displaystyle{ z _{ 0 } }$ and $\displaystyle{ y _{ i } }$ , and $\displaystyle{ b }$ is the bias of $\displaystyle{ z _{ 0 } }$ .

The activation function is there to make sure the neural network is not just a big system of linear equations by adding some extra complexity. Popular examples include $\displaystyle{ \max \left( 0 , x \right) }$ , known as ReLU, and $\displaystyle{ \tanh \left( x \right) }$ .

To evaluate a neural network, you simply set the first layer's neurons to the input and then calculate the value of each neuron layer by layer until the last one, which is the output. This is called forward propagation. The problem then becomes finding a set of weights and biases that gives us the output we want. To do this, we first need to define our needs explicitly with a loss function. A common goal for a network is to learn what to do based on a set of examples; that is, we know the correct results and want the network to replicate and extrapolate them. This falls under supervised learning and has the advantage of having an easy way of spelling out what we want. We simply need to minimize the difference between the outputs we want and the outputs we get. There are many functions to calculate this error, one of the simplest being the Mean Squared Error (MSE, for short), which is exactly what it sounds like:

\displaystyle{ M S E = \frac{ \left( y _{ 1 } - \hat{ y } _{ 1 } \right) ^{ 2 } + \left( y _{ 2 } - \hat{ y } _{ 2 } \right) ^{ 2 } + \ldots + \left( y _{ n } - \hat{ y } _{ n } \right) ^{ 2 } }{ n } }

\displaystyle{ = \frac{ 1 }{ n } \sum _{ i = 1 } ^{ n } \left( y _{ i } - \hat{ y } _{ i } \right) ^{ 2 } }

Where $\displaystyle{ \hat{ y } _{ i } }$ is the expected output and $\displaystyle{ y _{ i } }$ is the real one.

For example, if we wanted the two outputs to be $\displaystyle{ \left[ - 0.5 , 1 \right] }$ , but we got $\displaystyle{ \left[ 0 , 0.5 \right] }$ , we would calculate the error in these steps:

\displaystyle{ M S E = \frac{ \left( 0 - \left( - 0.5 \right) \right) ^{ 2 } + \left( 0.5 - 1 \right) ^{ 2 } }{ 2 } }

\displaystyle{ = \frac{ 0.25 + 0.25 }{ 2 } }

\displaystyle{ = 0.25 }

Now, we can finally set our goal to finding a set of parameters that minimizes the MSE. To do this, we first need to look at the simplest case: minimizing a function from a single parameter. The simplest algorithm for this task is called gradient descent and will be our starting point for machine learning.

Gradient Descent

Gradient descent is an algorithm that helps you find the minimum of a function with a small number of evaluations, as long as you know the function's derivative. To understand how it works, consider the simple function $\displaystyle{ f \left( x \right) = x ^{ 2 } }$ , and how its slope changes depending on where you are.

If the minimum is to the left of the $\displaystyle{ x }$ value, the slope is positive. If it is to its right, it is negative. Finally, if the $\displaystyle{ x }$ value is spot on, the slope is just $\displaystyle{ 0 }$ . This gives us reason to try subtracting the slope from the $\displaystyle{ x }$ value to get closer to the minimum. This is the essence of gradient descent. An extra factor, called the learning rate ( $\displaystyle{ \alpha }$ ), is added to control the size of each step. Each iteration of the algorithm looks like this:

\displaystyle{ x \leftarrow x - \alpha \cdot f ^{ ^{\prime} } \left( x \right) }

For example, starting with $\displaystyle{ x = 1 }$ and using a learning of $\displaystyle{ \alpha = 0.3 }$ , we can minimize $\displaystyle{ f \left( x \right) = x ^{ 2 } }$ using these calculations:

\displaystyle{ x \leftarrow x - \alpha \cdot f ^{ ^{\prime} } \left( x \right) }

\displaystyle{ x \leftarrow 1 - 0.3 \cdot f ^{ ^{\prime} } \left( 1 \right) = 0.4 }

\displaystyle{ x \leftarrow 0.4 - 0.3 \cdot f ^{ ^{\prime} } \left( 0.4 \right) = 0.16 }

\displaystyle{ x \leftarrow 0.016 - 0.3 \cdot f ^{ ^{\prime} } \left( 0.16 \right) = 0.064 }

As we can see, $\displaystyle{ x }$ is slowly converging into the minimum of $\displaystyle{ 0 }$ . We can get some more insight by visualizing the algorithm. For example, this is how minimizing $\displaystyle{ f \left( x \right) = x ^{ 3 } + 3 x ^{ 2 } }$ looks with a small learning rate of $\displaystyle{ 0.04 }$ :

There are two notable behaviors. First, the steeper the slope, the faster the $\displaystyle{ x }$ value moves. Second, the algorithm does not actually converge to the minimum of the function but rather the closest local minimum. If the starting point was just an inch more negative, the algorithm would just end up going to the left indefinitely.

Choosing the right learning rate is usually a process of trial and error. The last example used an extremely small one to make the animation smoother, making the convergence slow. Nevertheless, you can overdo it, as learning rates that are too high make the algorithm bounce back and forth. For example, using $\displaystyle{ \alpha = 0.45 }$ on the last example makes it behave erratically:

Gradient descent with learning rate too high

Now that we know how to minimize a function, we can set our goal to minimizing our error. The only thing we need is the derivative of the error $\displaystyle{ E }$ in terms of each parameter $\displaystyle{ p }$ , which we can write as $\displaystyle{ \frac{ d E }{ d p } }$ for now. Then, we can just update each parameter with gradient descent:

\displaystyle{ p \leftarrow p - \alpha \frac{ d E }{ d p } }

This means we can reduce our problem to finding the derivative of the error with respect to each parameter. But, because our neural network has a large number of parameters that affect each other in complex ways, we first need to understand how derivatives work with multiple variables.

Derivatives with Multiple Variables

Consider the basic derivative, where a function is defined in terms of one variable, and we need to calculate its derivative. For example:

\displaystyle{ f \left( x \right) = x ^{ 3 } }

\displaystyle{ f ^{ ^{\prime} } \left( x \right) = 3 x ^{ 2 } }

First, we should stop using the function notation:

\displaystyle{ y = x ^{ 3 } }

\displaystyle{ \frac{ {\text{d}y} }{ {\text{d}x} } = 3 x ^{ 2 } }

This gives us a few benefits, starting with being able to calculate the derivative the other way around:

\displaystyle{ x = \sqrt[ 3 ]{ y } }

\displaystyle{ \frac{ {\text{d}x} }{ {\text{d}y} } = \frac{ 1 }{ 3 \sqrt[ 3 ]{ y ^{ 2 } } } = y ^{ - \frac{ 3 }{ 2 } } {/} 3 }

We should take a minute to analyze what these values mean. If you change $\displaystyle{ x }$ by a small value $\displaystyle{ \epsilon }$ , then $\displaystyle{ y }$ will be changed by $\displaystyle{ \epsilon \cdot 3 x ^{ 2 } }$ . Similarly, if you change $\displaystyle{ y }$ by a small value $\displaystyle{ \epsilon }$ , then $\displaystyle{ x }$ will be changed by $\displaystyle{ \epsilon \cdot y ^{ - \frac{ 3 }{ 2 } } {/} 3 }$ . We can visualize these derivatives like this:

We can now get started with more variables. Take this simple system:

\displaystyle{ b = 3 a ^{ 4 } }

\displaystyle{ c = \sin \left( b \right) }

We can see the relationship using the same visualization:

We can easily calculate that $\displaystyle{ \frac{ d b }{ da } = 12 a ^{ 3 } }$ and $\displaystyle{ \frac{ d c }{ d b } = \cos \left( b \right) }$ , but what about $\displaystyle{ \frac{ d c }{ d b } }$ ? First of all, when working with multiple variables, we usually change the $\displaystyle{ d }$ to a $\displaystyle{ \partial }$ . An approach feasible for simple systems is solving $\displaystyle{ \frac{ \partial c }{ \partial a } }$ manually using the chain rule:

\displaystyle{ c = \sin \left( b \right) = \sin \left( 3 a ^{ 4 } \right) }

\displaystyle{ \frac{ \partial c }{ \partial a } = \cos \left( 3 a ^{ 4 } \right) \cdot 12 a ^{ 3 } = \cos \left( b \right) \cdot 12 a ^{ 3 } }

But consider the meaning of the graph with our derivatives:

If each small change $\displaystyle{ \epsilon }$ in $\displaystyle{ a }$ causes a $\displaystyle{ \epsilon \cdot 12 a ^{ 3 } }$ change in $\displaystyle{ b }$ , and each small change $\displaystyle{ \delta }$ in $\displaystyle{ b }$ causes a $\displaystyle{ \delta \cdot \cos \left( b \right) }$ change in $\displaystyle{ c }$ , then we just have a total $\displaystyle{ \left( \epsilon \cdot 12 a ^{ 3 } \right) \cdot \cos \left( b \right) }$ change in $\displaystyle{ c }$ . This logic holds up in general, resulting in a new rule for derivatives involving three variables, $\displaystyle{ a }$ , $\displaystyle{ b }$ , and $\displaystyle{ c }$ , defined sequentially in terms of one another:

\displaystyle{ \frac{ \partial c }{ \partial a } = \frac{ \partial c }{ \partial b } \cdot \frac{ \partial b }{ \partial a } }

Naturally, we can reuse this property multiple times. For example, with four variables:

First, we can calculate $\displaystyle{ \frac{ \partial d }{ \partial a } }$ as $\displaystyle{ \frac{ \partial d }{ \partial b } \cdot \frac{ \partial b }{ \partial a } }$ . The process is then repeated for $\displaystyle{ \frac{ \partial d }{ \partial b } = \frac{ \partial d }{ \partial c } \cdot \frac{ \partial c }{ \partial b } }$ , yielding a final answer of $\displaystyle{ \frac{ \partial d }{ \partial a } = \frac{ \partial d }{ \partial c } \cdot \frac{ \partial c }{ \partial b } \cdot \frac{ \partial b }{ \partial a } }$ . For more variables, the result is the same: the product of the derivatives of neighboring variables. This is the generalized version of the chain rule. We are almost done with multi-variable derivatives, except for a special case. Say you have:

\displaystyle{ b = a ^{ 2 } }

\displaystyle{ c = - 7 a }

\displaystyle{ d = b \cdot c }

How do you calculate $\displaystyle{ \frac{ \partial d }{ \partial a } }$ ? The shape of the graph is different:

Consider each one of the paths, $\displaystyle{ \frac{ \partial b }{ \partial a } \cdot \frac{ \partial d }{ \partial b } = 2 a \cdot c }$ (in blue), and $\displaystyle{ \frac{ \partial c }{ \partial a } \cdot \frac{ \partial d }{ \partial c } = - 7 \cdot b }$ (in green). You can understand them as the effect $\displaystyle{ a }$ has over $\displaystyle{ d }$ through $\displaystyle{ b }$ and $\displaystyle{ c }$ , respectively. So, if a small change $\displaystyle{ \epsilon }$ causes a $\displaystyle{ \epsilon \cdot 2 a \cdot c }$ change in $\displaystyle{ d }$ through $\displaystyle{ b }$ , and that same change also causes a $\displaystyle{ \epsilon \cdot \left( - 7 \right) \cdot b }$ change in $\displaystyle{ d }$ through $\displaystyle{ c }$ , then, the total change is simply their sum $\displaystyle{ \epsilon \cdot 2 a \cdot c + \epsilon \cdot \left( - 7 \right) \cdot b = \epsilon \cdot \left( 2 a c - 7 b \right) }$ . The derivative is $\displaystyle{ \frac{ \partial d }{ \partial a } = \frac{ \partial b }{ \partial a } \cdot \frac{ \partial d }{ \partial b } + \frac{ \partial c }{ \partial a } \cdot \frac{ \partial d }{ \partial c } = 2 a c - 7 b }$ . So, generally, if we have a variable that affects another through different paths, we simply add the derivatives through each path to get our result.

With this, we are finally ready to tackle how neural networks learn.

Training Neural Networks

We now just need to apply gradient descent to each of the weights and biases of the neural network. Unfortunately, a problem quickly arises. Say you wanted to calculate the derivative of the error in terms of the green weight ( $\displaystyle{ \frac{ \partial E }{ \partial w } }$ ) to do a step of gradient descent in this simple neural network:

Relationship from the weight to the error

It is not a trivial calculation; there is a complex path with multiple intermediate variables (in red). All for just one weight in one iteration of a toy neural network. There is also no intuitive generalization for all weights that we can simplify into a simple expression. Imagine this, but with a million parameters.

The solution comes from the simple observation that every parameter is connected to the error exclusively through the neurons of the next layer. Everything from then on forward can be treated as a black box. This gives us reason to try to calculate the derivative of the error in terms of every neuron of the final layer, and use those values to calculate the derivative of the error in terms of the second to last layer, and so on, working backwards from the error. This is the essence of backpropagation, an algorithm which follows these steps:

Calculate the derivative of the error in terms of the output of the final layer.
Calculate the derivative of the error in terms of the output of the previous layer.
Calculate the derivative of the parameters of the layer.
Apply a step of gradient descent using those derivatives.
Repeat steps 2-4 for all layers, going backwards.

But before that, we need to take some time to establish the notation system. When talking about an individual layer, the $\displaystyle{ i }$ th input (i.e., the activation of the $\displaystyle{ i }$ th neuron of the previous layer) will be written as $\displaystyle{ x _{ i } }$ , and the $\displaystyle{ j }$ th output will be written as $\displaystyle{ y ^{ j } }$ . The bias of the $\displaystyle{ j }$ th neuron will be called $\displaystyle{ b ^{ j } }$ , and the weight from $\displaystyle{ x _{ i } }$ to $\displaystyle{ y ^{ j } }$ will be $\displaystyle{ w _{ i } ^{ j } }$ . Notice that the subscripts correspond to the input space and the superscripts to the output space.

We can make a small simplification to make our jobs easier. We currently define the outputs of one of our layers like this:

\displaystyle{ y ^{ j } = \sigma \left( w _{ 0 } ^{ j } x _{ 0 } + w _{ 1 } ^{ j } x _{ 1 } + w _{ 2 } ^{ j } x _{ 2 } + \ldots + w _{ i } ^{ j } x _{ i } + b ^{ j } \right) }

We are doing two different tasks here: the weighted sum and the activation function. We can split them into different layers.

The first one is called a Fully Connected Layer, and the outputs can be calculated as the sum:

\displaystyle{ y ^{ j } = w _{ 0 } ^{ j } x _{ 0 } + w _{ 1 } ^{ j } x _{ 1 } + w _{ 2 } ^{ j } x _{ 2 } + \ldots + w _{ i } ^{ j } x _{ i } + b ^{ j } }

It has $\displaystyle{ i \cdot j }$ weights, and $\displaystyle{ j }$ biases, and looks like this:

The second one is an Activation Layer, which simply applies an activation function to its input:

\displaystyle{ y ^{ i } = \sigma \left( x _{ i } \right) }

It has a simpler structure and no trainable parameters:

Now, we can start tackling each of the steps, and then joining them.

Derivative of the Error

Our first step will be to calculate the derivative of the error in terms of the outputs $\displaystyle{ y ^{ j } }$ of the network. This calculation is trivial:

\displaystyle{ E = \frac{ \left( y ^{ 0 } - \hat{ y } ^{ 0 } \right) ^{ 2 } + \left( y ^{ 1 } - \hat{ y } ^{ 1 } \right) ^{ 2 } + \ldots + \left( y ^{ j } - \hat{ y } ^{ j } \right) ^{ 2 } }{ n } }

\displaystyle{ \frac{ \partial E }{ \partial y ^{ j } } = 2 \frac{ y ^{ j } - \hat{ y } ^{ j } }{ n } }

Note: The formula was changed to be zero-indexed.

Propagating Backwards the Error of a Fully Connected Layer

To allow the computation of the previous layers, we need to calculate the derivative of the inputs $\displaystyle{ \frac{ \partial E }{ \partial x _{ i } } }$ knowing the derivative of the outputs $\displaystyle{ \frac{ \partial E }{ \partial y ^{ j } } }$ . Take a look at our formula for the output:

\displaystyle{ y ^{ j } = w _{ 0 } ^{ j } x _{ 0 } + w _{ 1 } ^{ j } x _{ 1 } + w _{ 2 } ^{ j } x _{ 2 } + \ldots + w _{ i } ^{ j } x _{ i } + b ^{ j } }

We can easily compute $\displaystyle{ \frac{ \partial y ^{ j } }{ \partial x _{ i } } }$ to be $\displaystyle{ w _{ i } ^{ j } }$ . Also, all of the input neurons are present when calculating each output neuron. Because the input neurons are connected to the error solely through the output neurons, we have this situation:

We can start calculating:

\displaystyle{ \frac{ \partial E }{ \partial x _{ i } } = \frac{ \partial y ^{ 0 } }{ \partial x _{ i } } \cdot \frac{ \partial E }{ \partial y ^{ 0 } } + \frac{ \partial y ^{ 1 } }{ \partial x _{ i } } \cdot \frac{ \partial E }{ \partial y ^{ 1 } } + \frac{ \partial y ^{ 2 } }{ \partial x _{ i } } \cdot \frac{ \partial E }{ \partial y ^{ 2 } } + \ldots + \frac{ \partial y ^{ j } }{ \partial x _{ i } } \cdot \frac{ \partial E }{ \partial y ^{ j } } }

\displaystyle{ = w _{ i } ^{ 0 } \cdot \frac{ \partial E }{ \partial y ^{ 0 } } + w _{ i } ^{ 1 } \cdot \frac{ \partial E }{ \partial y ^{ 1 } } + w _{ i } ^{ 2 } \cdot \frac{ \partial E }{ \partial y ^{ 2 } } + \ldots + w _{ i } ^{ j } \cdot \frac{ \partial E }{ \partial y ^{ j } } }

And with this, we are done, and we can use this as the derivative of the errors in terms of the output of the previous layer.

Updating the Parameters of a Fully Connected Layer

We can use the same logic for calculating the derivatives of the error in terms of the parameters. Starting with the weights:

\displaystyle{ y ^{ j } = w _{ 0 } ^{ j } x _{ 0 } + w _{ 1 } ^{ j } x _{ 1 } + w _{ 2 } ^{ j } x _{ 2 } + \ldots + w _{ i } ^{ j } x _{ i } + b ^{ j } }

The derivative $\displaystyle{ \frac{ \partial y ^{ j } }{ \partial w _{ i } ^{ j } } }$ is simply $\displaystyle{ x _{ i } }$ . Each weight is exclusive to one input and one output, so the relationship is very simple:

Our final calculation will be:

\displaystyle{ \frac{ \partial E }{ \partial w _{ i } ^{ j } } = \frac{ \partial y ^{ j } }{ \partial w _{ i } ^{ j } } \cdot \frac{ \partial E }{ \partial y ^{ j } } }

\displaystyle{ = x _{ i } \cdot \frac{ \partial E }{ \partial y ^{ j } } }

For the biases, it is equally simple:

\displaystyle{ y ^{ j } = w _{ 0 } ^{ j } x _{ 0 } + w _{ 1 } ^{ j } x _{ 1 } + w _{ 2 } ^{ j } x _{ 2 } + \ldots + w _{ i } ^{ j } x _{ i } + b ^{ j } }

Each bias is exclusive to an output neuron, with the derivative $\displaystyle{ \frac{ \partial y ^{ j } }{ \partial b ^{ j } } }$ being just $\displaystyle{ 1 }$ . The relationship is the same as the weight:

But our result will be even simpler:

\displaystyle{ \frac{ \partial E }{ \partial b ^{ j } } = \frac{ \partial y ^{ j } }{ \partial b ^{ j } } \cdot \frac{ \partial E }{ \partial y ^{ j } } }

\displaystyle{ = \frac{ \partial E }{ \partial y ^{ j } } }

With both of our derivatives, we can now calculate the new values of the parameters through gradient descent:

\displaystyle{ w _{ i } ^{ j } \leftarrow w _{ i } ^{ j } - \alpha \cdot \frac{ \partial E }{ \partial w _{ i } ^{ j } } }

\displaystyle{ w _{ i } ^{ j } \leftarrow w _{ i } ^{ j } - \alpha \cdot x _{ i } \cdot \frac{ \partial E }{ \partial y ^{ j } } }

And:

\displaystyle{ b ^{ j } \leftarrow b ^{ j } - \alpha \cdot \frac{ \partial E }{ \partial b ^{ j } } }

\displaystyle{ b ^{ j } \leftarrow b ^{ j } - \alpha \cdot \frac{ \partial E }{ \partial y ^{ j } } }

Propagating Backwards the Error of an Activation Layer

For the final step of our puzzle, we just need to calculate the derivative of the error in terms of the input of the activation layer $\displaystyle{ \frac{ \partial E }{ \partial x _{ i } } }$ , knowing it in terms of the output $\displaystyle{ \frac{ \partial E }{ \partial y ^{ i } } }$ . The formula for the Activation Layer is the simplest one yet:

\displaystyle{ y ^{ i } = \sigma \left( x _{ i } \right) }

The derivative $\displaystyle{ \frac{ \partial y ^{ i } }{ \partial x _{ i } } }$ is equal to $\displaystyle{ \sigma ^{ ^{\prime} } \left( x _{ i } \right) }$ . The relationship to the error is straightforward:

Propagating backwards an activation layer

With this, we can get our derivative:

\displaystyle{ \frac{ \partial E }{ \partial x _{ i } } = \frac{ \partial y ^{ i } }{ \partial x _{ i } } \cdot \frac{ \partial E }{ \partial y ^{ i } } }

\displaystyle{ = \sigma ^{ ^{\prime} } \left( x _{ i } \right) \cdot \frac{ \partial E }{ \partial y ^{ i } } }

This is the last piece of knowledge needed to implement backpropagation. But, for clarity's sake, let's go through a simple example part by part.

Putting It Together: A Simple Example

To understand backpropagation, it is useful to go through a single iteration in a small neural network, like this one:

The dashed lines are from the activation layers, the values in the lines are the weights, and for simplicity, all of the biases will start at $\displaystyle{ 0 }$ . The learning rate to be set to $\displaystyle{ \alpha = 0.1 }$ and out activation function to ReLU ( $\displaystyle{ \max \left( x , 0 \right) }$ ). First, we need some data. Assume inputs are $\displaystyle{ \left[ 1 , 0.5 , 1 \right] }$ , and the expected output is $\displaystyle{ \left[ 0.5 , 1 \right] }$ . Evaluating the network does not give us the result we want:

The output is not quite right. Our error is $\displaystyle{ M S E = 0.25 }$ . Now, we can update our parameters using backpropagation, step by step.

Step 1: Derivative of the Error in Terms of the Outputs

We can just use our formula:

\displaystyle{ \frac{ \partial E }{ \partial y ^{ j } } = 2 \frac{ y ^{ j } - \hat{ y } ^{ j } }{ n } }

Substituting in our values:

\displaystyle{ \frac{ \partial E }{ \partial o _{ 0 } } = 2 \frac{ 0 - 0.5 }{ 2 } = - 0.5 }

\displaystyle{ \frac{ \partial E }{ \partial o _{ 1 } } = 2 \frac{ 0.5 - 1 }{ 2 } = - 0.5 }

Step 2: Propagating Back the First Activation Layer

For this, we will need the derivative of our error function. In this case, the derivative of ReLU is:

\displaystyle{ \sigma ^{ ^{\prime} } \left( x \right) = \text{ReLU} ^{ ^{\prime} } \left( x \right) = \left\lbrace \begin{array}{ll} 0 & \text{if}\quad x < 0 \\ 1 & \text{if}\quad x > 0 \end{array} \right. }

Now, we can use our formula:

\displaystyle{ \frac{ \partial E }{ \partial x _{ i } } = \sigma ^{ ^{\prime} } \left( x _{ i } \right) \cdot \frac{ \partial E }{ \partial y ^{ j } } }

And once again evaluate:

\displaystyle{ \frac{ \partial E }{ \partial c _{ 0 } } = \sigma ^{ ^{\prime} } \left( c _{ 0 } \right) \cdot \frac{ \partial E }{ \partial o _{ 0 } } }

\displaystyle{ = \sigma ^{ ^{\prime} } \left( - 0.5 \right) \cdot \left( - 0.5 \right) = 0 }

And:

\displaystyle{ \frac{ \partial E }{ \partial c _{ 1 } } = \sigma ^{ ^{\prime} } \left( c _{ 1 } \right) \cdot \frac{ \partial E }{ \partial o _{ 1 } } }

\displaystyle{ = \sigma ^{ ^{\prime} } \left( 0.5 \right) \cdot \left( - 0.5 \right) = - 0.5 }

Step 3: Propagating Back the First Fully Connected Layer

Using our formula to substitute:

\displaystyle{ \frac{ \partial E }{ \partial x _{ i } } = w _{ i } ^{ 0 } \cdot \frac{ \partial E }{ \partial y ^{ 0 } } + w _{ i } ^{ 1 } \cdot \frac{ \partial E }{ \partial y ^{ 1 } } + w _{ i } ^{ 2 } \cdot \frac{ \partial E }{ \partial y ^{ 2 } } + \ldots + w _{ i } ^{ j } \cdot \frac{ \partial E }{ \partial y ^{ j } } }

We get:

\displaystyle{ \frac{ \partial E }{ \partial b _{ 0 } } = \left( - 1 \right) \cdot \frac{ \partial E }{ \partial c _{ 0 } } + 1 \cdot \frac{ \partial E }{ \partial c _{ 1 } } }

\displaystyle{ = \left( - 1 \right) \cdot 0 + 1 \cdot \left( - 0.5 \right) = - 0.5 }

Step 4: Updating the Parameters of the First Fully Connected Layer

Starting with the weights:

\displaystyle{ w _{ i } ^{ j } \leftarrow w _{ i } ^{ j } - \alpha \cdot x _{ i } \cdot \frac{ \partial E }{ \partial y ^{ j } } }

Applying it to our network, we get:

\displaystyle{ w _{ b _{ 0 } } ^{ c _{ 0 } } \leftarrow - 1 - 0.1 \cdot 0.5 \cdot \frac{ \partial E }{ \partial c _{ 0 } } = - 1 }

\displaystyle{ w _{ b _{ 0 } } ^{ c _{ 1 } } \leftarrow 1 - 0.1 \cdot 0.5 \cdot \frac{ \partial E }{ \partial c _{ 1 } } = 1.025 }

The biases are similar:

\displaystyle{ b ^{ j } \leftarrow b ^{ j } - \alpha \cdot \frac{ \partial E }{ \partial y ^{ j } } }

Substituting:

\displaystyle{ b ^{ c _{ 0 } } \leftarrow 0 - 0.1 \cdot \frac{ \partial E }{ \partial c _{ 0 } } = 0 }

\displaystyle{ b ^{ c _{ 1 } } \leftarrow 0 - 0.1 \cdot \frac{ \partial E }{ \partial c _{ 1 } } = 0.05 }

Step 5: Repeat

Now, we just need to repeat backwards:

\displaystyle{ \frac{ \partial E }{ \partial a _{ 0 } } = \sigma ^{ ^{\prime} } \left( 0.5 \right) \cdot \left( - 0.5 \right) = - 0.5 }

\displaystyle{ w _{ i _{ 0 } } ^{ a _{ 0 } } \leftarrow 0.5 - 0.1 \cdot 1 \cdot \left( - 0.5 \right) = 0.55 }

\displaystyle{ w _{ i _{ 1 } } ^{ a _{ 0 } } \leftarrow 1 - 0.1 \cdot 0.5 \cdot \left( - 0.5 \right) = 1.025 }

\displaystyle{ w _{ i _{ 2 } } ^{ a _{ 0 } } \leftarrow - 0.5 - 0.1 \cdot 1 \cdot \left( - 0.5 \right) = - 0.45 }

\displaystyle{ b ^{ a _{ 0 } } \leftarrow 0 - 0.1 \cdot \left( - 0.5 \right) = 0.05 }

Calculating the derivative of the output in terms of the first layer is not necessary, because all the parameters were already updated. Our new parameters are $\displaystyle{ w _{ b _{ 0 } } ^{ c _{ 0 } } = - 1 }$ , $\displaystyle{ w _{ b _{ 0 } } ^{ c _{ 1 } } = 1.025 }$ , $\displaystyle{ b ^{ c _{ 0 } } = 0 }$ , $\displaystyle{ b ^{ c _{ 1 } } = 0.05 }$ , $\displaystyle{ w _{ i _{ 0 } } ^{ a _{ 0 } } = 0.55 }$ , $\displaystyle{ w _{ i _{ 1 } } ^{ a _{ 0 } } = 1.025 }$ , $\displaystyle{ w _{ i _{ 2 } } ^{ a _{ 0 } } = - 0.45 }$ , and $\displaystyle{ b ^{ a _{ 0 } } = 0.05 }$ . To train a neural network, we just need to do this many times with multiple data points.

Conclusion

Congratulations! If you followed everything so far, you have the knowledge necessary to implement backpropagation. I would recommend trying to replicate the formulas without looking at the article to better understand the process used to derive them, which is applicable to many other areas. If you want to check how this algorithm translates to code, check out my Rust implementation.

Next Steps

This article focuses on the math of the most basic way of implementing basic neural networks, so it is just a starting point for deep learning. From here, you might be interested in diving into:

Matrices: Our algorithm can be expressed more elegantly using matrices, with a huge boost in performance if the work is offloaded to the GPU.
Other Layers: The two layers we have learned so far are very powerful, but other types have their own advantages. For example, convolutional layers are widely used in image processing.
Optimizers: Gradient descent is the basis for almost all ML optimization algorithms, but by itself it is rather inefficient. Today, a plethora of algorithms that converge faster are available.
Regularization: Larger neural networks can suffer several flaws, including overfitting (a tendency to memorize the data rather than learn the fundamental pattern). Regularization techniques like dropout help counter these problems.
Different Architectures: For some tasks, new architectures are needed. For example, when working with sequences, recurrent neural networks (RNNs) are commonly used, which can hold on to memory.
Unlabeled Data: Sometimes you do not know the ideal output, so you require a new way of learning. For example, for learning to play games reinforcement learning is used, using rewards and punishments to steer the network towards desired behaviors. Likewise, for compressing data into a smaller size, the autoencoder architecture is useful.

Resources

To learn more about machine learning, Omar Aflak's Data Towards Science article is a good resource that explains how to derive the matrix version of the operations and places a greater emphasis on architecture and implementation (for example, it introduced me to the idea of splitting the activation layer). For video content, the Coding Lane channel has several series with advanced concepts. I recommend the Improving Neural Networks series, which covers regularization and optimizers.