<

Wed Jan 25 2023

Understanding Backpropagation with Simple Derivatives

Backpropagation is the backbone of deep learning, responsible for the training phase of neural networks. Despite its importance, it is often left out of most explanations of neural networks due to its complexity. Many summaries of backpropagation either assume a level of background knowledge that is unnecessary for understanding the algorithm or focus explicitly on implementation, leaving a working program but no insight.

This article will explain backpropagation using the bare minimum amount of knowledge required —in this case, only basic derivatives— and to an extent in which you can comfortably implement your own neural networks. For this, a few detours are needed, starting with a refresher on neural networks.

Neural Networks

The basic neural network structure consists of a series of layers made up of multiple neurons. Each neuron is connected to every neuron in the previous layer by a weight, which represents the strength of the connection. A small neural network may look something like this:

Neural net

The value of each neuron is calculated by adding up the values of the previous layer's neurons multiplied with their respective weights, plus an extra term called the bias ( b\displaystyle{ b } ). All of this is passed through an activation function (generally written as σ\displaystyle{ \sigma } ) giving us a final result. For example, the value of z0\displaystyle{ z _{ 0 } } can be calculated as:

z0=σ(w0y0+w1y1+w2y2+b)\displaystyle{ z _{ 0 } = \sigma \left( w _{ 0 } y _{ 0 } + w _{ 1 } y _{ 1 } + w _{ 2 } y _{ 2 } + b \right) }

Where wi\displaystyle{ w _{ i } } is the weight of the connection between z0\displaystyle{ z _{ 0 } } and yi\displaystyle{ y _{ i } } , and b\displaystyle{ b } is the bias of z0\displaystyle{ z _{ 0 } } .

The activation function is there to make sure the neural network is not just a big system of linear equations by adding some extra complexity. Popular examples include max(0,x)\displaystyle{ \max \left( 0 , x \right) } , known as ReLU, and tanh(x)\displaystyle{ \tanh \left( x \right) } .

To evaluate a neural network, you simply set the first layer's neurons to the input and then calculate the value of each neuron layer by layer until the last one, which is the output. This is called forward propagation. The problem then becomes finding a set of weights and biases that gives us the output we want. To do this, we first need to define our needs explicitly with a loss function. A common goal for a network is to learn what to do based on a set of examples; that is, we know the correct results and want the network to replicate and extrapolate them. This falls under supervised learning and has the advantage of having an easy way of spelling out what we want. We simply need to minimize the difference between the outputs we want and the outputs we get. There are many functions to calculate this error, one of the simplest being the Mean Squared Error (MSE, for short), which is exactly what it sounds like:

MSE=(y1y^1)2+(y2y^2)2++(yny^n)2n\displaystyle{ M S E = \frac{ \left( y _{ 1 } - \hat{ y } _{ 1 } \right) ^{ 2 } + \left( y _{ 2 } - \hat{ y } _{ 2 } \right) ^{ 2 } + \ldots + \left( y _{ n } - \hat{ y } _{ n } \right) ^{ 2 } }{ n } }
=1ni=1n(yiy^i)2\displaystyle{ = \frac{ 1 }{ n } \sum _{ i = 1 } ^{ n } \left( y _{ i } - \hat{ y } _{ i } \right) ^{ 2 } }

Where y^i\displaystyle{ \hat{ y } _{ i } } is the expected output and yi\displaystyle{ y _{ i } } is the real one.

For example, if we wanted the two outputs to be [0.5,1]\displaystyle{ \left[ - 0.5 , 1 \right] } , but we got [0,0.5]\displaystyle{ \left[ 0 , 0.5 \right] } , we would calculate the error in these steps:

MSE=(0(0.5))2+(0.51)22\displaystyle{ M S E = \frac{ \left( 0 - \left( - 0.5 \right) \right) ^{ 2 } + \left( 0.5 - 1 \right) ^{ 2 } }{ 2 } }
=0.25+0.252\displaystyle{ = \frac{ 0.25 + 0.25 }{ 2 } }
=0.25\displaystyle{ = 0.25 }

Now, we can finally set our goal to finding a set of parameters that minimizes the MSE. To do this, we first need to look at the simplest case: minimizing a function from a single parameter. The simplest algorithm for this task is called gradient descent and will be our starting point for machine learning.

Gradient Descent

Gradient descent is an algorithm that helps you find the minimum of a function with a small number of evaluations, as long as you know the function's derivative. To understand how it works, consider the simple function f(x)=x2\displaystyle{ f \left( x \right) = x ^{ 2 } } , and how its slope changes depending on where you are.

Function slopes

If the minimum is to the left of the x\displaystyle{ x } value, the slope is positive. If it is to its right, it is negative. Finally, if the x\displaystyle{ x } value is spot on, the slope is just 0\displaystyle{ 0 } . This gives us reason to try subtracting the slope from the x\displaystyle{ x } value to get closer to the minimum. This is the essence of gradient descent. An extra factor, called the learning rate ( α\displaystyle{ \alpha } ), is added to control the size of each step. Each iteration of the algorithm looks like this:

xxαf(x)\displaystyle{ x \leftarrow x - \alpha \cdot f ^{ ^{\prime} } \left( x \right) }

For example, starting with x=1\displaystyle{ x = 1 } and using a learning of α=0.3\displaystyle{ \alpha = 0.3 } , we can minimize f(x)=x2\displaystyle{ f \left( x \right) = x ^{ 2 } } using these calculations:

xxαf(x)\displaystyle{ x \leftarrow x - \alpha \cdot f ^{ ^{\prime} } \left( x \right) }
x10.3f(1)=0.4\displaystyle{ x \leftarrow 1 - 0.3 \cdot f ^{ ^{\prime} } \left( 1 \right) = 0.4 }
x0.40.3f(0.4)=0.16\displaystyle{ x \leftarrow 0.4 - 0.3 \cdot f ^{ ^{\prime} } \left( 0.4 \right) = 0.16 }
x0.0160.3f(0.16)=0.064\displaystyle{ x \leftarrow 0.016 - 0.3 \cdot f ^{ ^{\prime} } \left( 0.16 \right) = 0.064 }

As we can see, x\displaystyle{ x } is slowly converging into the minimum of 0\displaystyle{ 0 } . We can get some more insight by visualizing the algorithm. For example, this is how minimizing f(x)=x3+3x2\displaystyle{ f \left( x \right) = x ^{ 3 } + 3 x ^{ 2 } } looks with a small learning rate of 0.04\displaystyle{ 0.04 } :

Gradient descent animation

There are two notable behaviors. First, the steeper the slope, the faster the x\displaystyle{ x } value moves. Second, the algorithm does not actually converge to the minimum of the function but rather the closest local minimum. If the starting point was just an inch more negative, the algorithm would just end up going to the left indefinitely.

Choosing the right learning rate is usually a process of trial and error. The last example used an extremely small one to make the animation smoother, making the convergence slow. Nevertheless, you can overdo it, as learning rates that are too high make the algorithm bounce back and forth. For example, using α=0.45\displaystyle{ \alpha = 0.45 } on the last example makes it behave erratically:

Gradient descent with learning rate too high

Now that we know how to minimize a function, we can set our goal to minimizing our error. The only thing we need is the derivative of the error E\displaystyle{ E } in terms of each parameter p\displaystyle{ p } , which we can write as dEdp\displaystyle{ \frac{ d E }{ d p } } for now. Then, we can just update each parameter with gradient descent:

ppαdEdp\displaystyle{ p \leftarrow p - \alpha \frac{ d E }{ d p } }

This means we can reduce our problem to finding the derivative of the error with respect to each parameter. But, because our neural network has a large number of parameters that affect each other in complex ways, we first need to understand how derivatives work with multiple variables.

Derivatives with Multiple Variables

Consider the basic derivative, where a function is defined in terms of one variable, and we need to calculate its derivative. For example:

f(x)=x3\displaystyle{ f \left( x \right) = x ^{ 3 } }
f(x)=3x2\displaystyle{ f ^{ ^{\prime} } \left( x \right) = 3 x ^{ 2 } }

First, we should stop using the function notation:

y=x3\displaystyle{ y = x ^{ 3 } }
dydx=3x2\displaystyle{ \frac{ {\text{d}y} }{ {\text{d}x} } = 3 x ^{ 2 } }

This gives us a few benefits, starting with being able to calculate the derivative the other way around:

x=y3\displaystyle{ x = \sqrt[ 3 ]{ y } }
dxdy=13y23=y32/3\displaystyle{ \frac{ {\text{d}x} }{ {\text{d}y} } = \frac{ 1 }{ 3 \sqrt[ 3 ]{ y ^{ 2 } } } = y ^{ - \frac{ 3 }{ 2 } } {/} 3 }

We should take a minute to analyze what these values mean. If you change x\displaystyle{ x } by a small value ϵ\displaystyle{ \epsilon } , then y\displaystyle{ y } will be changed by ϵ3x2\displaystyle{ \epsilon \cdot 3 x ^{ 2 } } . Similarly, if you change y\displaystyle{ y } by a small value ϵ\displaystyle{ \epsilon } , then x\displaystyle{ x } will be changed by ϵy32/3\displaystyle{ \epsilon \cdot y ^{ - \frac{ 3 }{ 2 } } {/} 3 } . We can visualize these derivatives like this:

x and y relationship

We can now get started with more variables. Take this simple system:

b=3a4\displaystyle{ b = 3 a ^{ 4 } }
c=sin(b)\displaystyle{ c = \sin \left( b \right) }

We can see the relationship using the same visualization:

x and y relationship

We can easily calculate that dbda=12a3\displaystyle{ \frac{ d b }{ da } = 12 a ^{ 3 } } and dcdb=cos(b)\displaystyle{ \frac{ d c }{ d b } = \cos \left( b \right) } , but what about dcdb\displaystyle{ \frac{ d c }{ d b } } ? First of all, when working with multiple variables, we usually change the d\displaystyle{ d } to a \displaystyle{ \partial } . An approach feasible for simple systems is solving ca\displaystyle{ \frac{ \partial c }{ \partial a } } manually using the chain rule:

c=sin(b)=sin(3a4)\displaystyle{ c = \sin \left( b \right) = \sin \left( 3 a ^{ 4 } \right) }
ca=cos(3a4)12a3=cos(b)12a3\displaystyle{ \frac{ \partial c }{ \partial a } = \cos \left( 3 a ^{ 4 } \right) \cdot 12 a ^{ 3 } = \cos \left( b \right) \cdot 12 a ^{ 3 } }

But consider the meaning of the graph with our derivatives:

a, b, and c relationship

If each small change ϵ\displaystyle{ \epsilon } in a\displaystyle{ a } causes a ϵ12a3\displaystyle{ \epsilon \cdot 12 a ^{ 3 } } change in b\displaystyle{ b } , and each small change δ\displaystyle{ \delta } in b\displaystyle{ b } causes a δcos(b)\displaystyle{ \delta \cdot \cos \left( b \right) } change in c\displaystyle{ c } , then we just have a total (ϵ12a3)cos(b)\displaystyle{ \left( \epsilon \cdot 12 a ^{ 3 } \right) \cdot \cos \left( b \right) } change in c\displaystyle{ c } . This logic holds up in general, resulting in a new rule for derivatives involving three variables, a\displaystyle{ a } , b\displaystyle{ b } , and c\displaystyle{ c } , defined sequentially in terms of one another:

ca=cbba\displaystyle{ \frac{ \partial c }{ \partial a } = \frac{ \partial c }{ \partial b } \cdot \frac{ \partial b }{ \partial a } }

Naturally, we can reuse this property multiple times. For example, with four variables:

a, b, c and d relationship

First, we can calculate da\displaystyle{ \frac{ \partial d }{ \partial a } } as dbba\displaystyle{ \frac{ \partial d }{ \partial b } \cdot \frac{ \partial b }{ \partial a } } . The process is then repeated for db=dccb\displaystyle{ \frac{ \partial d }{ \partial b } = \frac{ \partial d }{ \partial c } \cdot \frac{ \partial c }{ \partial b } } , yielding a final answer of da=dccbba\displaystyle{ \frac{ \partial d }{ \partial a } = \frac{ \partial d }{ \partial c } \cdot \frac{ \partial c }{ \partial b } \cdot \frac{ \partial b }{ \partial a } } . For more variables, the result is the same: the product of the derivatives of neighboring variables. This is the generalized version of the chain rule. We are almost done with multi-variable derivatives, except for a special case. Say you have:

b=a2\displaystyle{ b = a ^{ 2 } }
c=7a\displaystyle{ c = - 7 a }
d=bc\displaystyle{ d = b \cdot c }

How do you calculate da\displaystyle{ \frac{ \partial d }{ \partial a } } ? The shape of the graph is different:

a to b to d and a to c to d

Consider each one of the paths, badb=2ac\displaystyle{ \frac{ \partial b }{ \partial a } \cdot \frac{ \partial d }{ \partial b } = 2 a \cdot c } (in blue), and cadc=7b\displaystyle{ \frac{ \partial c }{ \partial a } \cdot \frac{ \partial d }{ \partial c } = - 7 \cdot b } (in green). You can understand them as the effect a\displaystyle{ a } has over d\displaystyle{ d } through b\displaystyle{ b } and c\displaystyle{ c } , respectively. So, if a small change ϵ\displaystyle{ \epsilon } causes a ϵ2ac\displaystyle{ \epsilon \cdot 2 a \cdot c } change in d\displaystyle{ d } through b\displaystyle{ b } , and that same change also causes a ϵ(7)b\displaystyle{ \epsilon \cdot \left( - 7 \right) \cdot b } change in d\displaystyle{ d } through c\displaystyle{ c } , then, the total change is simply their sum ϵ2ac+ϵ(7)b=ϵ(2ac7b)\displaystyle{ \epsilon \cdot 2 a \cdot c + \epsilon \cdot \left( - 7 \right) \cdot b = \epsilon \cdot \left( 2 a c - 7 b \right) } . The derivative is da=badb+cadc=2ac7b\displaystyle{ \frac{ \partial d }{ \partial a } = \frac{ \partial b }{ \partial a } \cdot \frac{ \partial d }{ \partial b } + \frac{ \partial c }{ \partial a } \cdot \frac{ \partial d }{ \partial c } = 2 a c - 7 b } . So, generally, if we have a variable that affects another through different paths, we simply add the derivatives through each path to get our result.

With this, we are finally ready to tackle how neural networks learn.

Training Neural Networks

We now just need to apply gradient descent to each of the weights and biases of the neural network. Unfortunately, a problem quickly arises. Say you wanted to calculate the derivative of the error in terms of the green weight ( Ew\displaystyle{ \frac{ \partial E }{ \partial w } } ) to do a step of gradient descent in this simple neural network:

Relationship from the weight to the error

It is not a trivial calculation; there is a complex path with multiple intermediate variables (in red). All for just one weight in one iteration of a toy neural network. There is also no intuitive generalization for all weights that we can simplify into a simple expression. Imagine this, but with a million parameters.

The solution comes from the simple observation that every parameter is connected to the error exclusively through the neurons of the next layer. Everything from then on forward can be treated as a black box. This gives us reason to try to calculate the derivative of the error in terms of every neuron of the final layer, and use those values to calculate the derivative of the error in terms of the second to last layer, and so on, working backwards from the error. This is the essence of backpropagation, an algorithm which follows these steps:

  1. Calculate the derivative of the error in terms of the output of the final layer.
  2. Calculate the derivative of the error in terms of the output of the previous layer.
  3. Calculate the derivative of the parameters of the layer.
  4. Apply a step of gradient descent using those derivatives.
  5. Repeat steps 2-4 for all layers, going backwards.

But before that, we need to take some time to establish the notation system. When talking about an individual layer, the i\displaystyle{ i } th input (i.e., the activation of the i\displaystyle{ i } th neuron of the previous layer) will be written as xi\displaystyle{ x _{ i } } , and the j\displaystyle{ j } th output will be written as yj\displaystyle{ y ^{ j } } . The bias of the j\displaystyle{ j } th neuron will be called bj\displaystyle{ b ^{ j } } , and the weight from xi\displaystyle{ x _{ i } } to yj\displaystyle{ y ^{ j } } will be wij\displaystyle{ w _{ i } ^{ j } } . Notice that the subscripts correspond to the input space and the superscripts to the output space.

We can make a small simplification to make our jobs easier. We currently define the outputs of one of our layers like this:

yj=σ(w0jx0+w1jx1+w2jx2++wijxi+bj)\displaystyle{ y ^{ j } = \sigma \left( w _{ 0 } ^{ j } x _{ 0 } + w _{ 1 } ^{ j } x _{ 1 } + w _{ 2 } ^{ j } x _{ 2 } + \ldots + w _{ i } ^{ j } x _{ i } + b ^{ j } \right) }

We are doing two different tasks here: the weighted sum and the activation function. We can split them into different layers.

The first one is called a Fully Connected Layer, and the outputs can be calculated as the sum:

yj=w0jx0+w1jx1+w2jx2++wijxi+bj\displaystyle{ y ^{ j } = w _{ 0 } ^{ j } x _{ 0 } + w _{ 1 } ^{ j } x _{ 1 } + w _{ 2 } ^{ j } x _{ 2 } + \ldots + w _{ i } ^{ j } x _{ i } + b ^{ j } }

It has ij\displaystyle{ i \cdot j } weights, and j\displaystyle{ j } biases, and looks like this:

Fully Connected Layer

The second one is an Activation Layer, which simply applies an activation function to its input:

yi=σ(xi)\displaystyle{ y ^{ i } = \sigma \left( x _{ i } \right) }

It has a simpler structure and no trainable parameters:

Activation Layer

Now, we can start tackling each of the steps, and then joining them.

Derivative of the Error

Our first step will be to calculate the derivative of the error in terms of the outputs yj\displaystyle{ y ^{ j } } of the network. This calculation is trivial:

E=(y0y^0)2+(y1y^1)2++(yjy^j)2n\displaystyle{ E = \frac{ \left( y ^{ 0 } - \hat{ y } ^{ 0 } \right) ^{ 2 } + \left( y ^{ 1 } - \hat{ y } ^{ 1 } \right) ^{ 2 } + \ldots + \left( y ^{ j } - \hat{ y } ^{ j } \right) ^{ 2 } }{ n } }
Eyj=2yjy^jn\displaystyle{ \frac{ \partial E }{ \partial y ^{ j } } = 2 \frac{ y ^{ j } - \hat{ y } ^{ j } }{ n } }

Note: The formula was changed to be zero-indexed.

Propagating Backwards the Error of a Fully Connected Layer

To allow the computation of the previous layers, we need to calculate the derivative of the inputs Exi\displaystyle{ \frac{ \partial E }{ \partial x _{ i } } } knowing the derivative of the outputs Eyj\displaystyle{ \frac{ \partial E }{ \partial y ^{ j } } } . Take a look at our formula for the output:

yj=w0jx0+w1jx1+w2jx2++wijxi+bj\displaystyle{ y ^{ j } = w _{ 0 } ^{ j } x _{ 0 } + w _{ 1 } ^{ j } x _{ 1 } + w _{ 2 } ^{ j } x _{ 2 } + \ldots + w _{ i } ^{ j } x _{ i } + b ^{ j } }

We can easily compute yjxi\displaystyle{ \frac{ \partial y ^{ j } }{ \partial x _{ i } } } to be wij\displaystyle{ w _{ i } ^{ j } } . Also, all of the input neurons are present when calculating each output neuron. Because the input neurons are connected to the error solely through the output neurons, we have this situation:

Variables of FCN backpropagation

We can start calculating:

Exi=y0xiEy0+y1xiEy1+y2xiEy2++yjxiEyj\displaystyle{ \frac{ \partial E }{ \partial x _{ i } } = \frac{ \partial y ^{ 0 } }{ \partial x _{ i } } \cdot \frac{ \partial E }{ \partial y ^{ 0 } } + \frac{ \partial y ^{ 1 } }{ \partial x _{ i } } \cdot \frac{ \partial E }{ \partial y ^{ 1 } } + \frac{ \partial y ^{ 2 } }{ \partial x _{ i } } \cdot \frac{ \partial E }{ \partial y ^{ 2 } } + \ldots + \frac{ \partial y ^{ j } }{ \partial x _{ i } } \cdot \frac{ \partial E }{ \partial y ^{ j } } }
=wi0Ey0+wi1Ey1+wi2Ey2++wijEyj\displaystyle{ = w _{ i } ^{ 0 } \cdot \frac{ \partial E }{ \partial y ^{ 0 } } + w _{ i } ^{ 1 } \cdot \frac{ \partial E }{ \partial y ^{ 1 } } + w _{ i } ^{ 2 } \cdot \frac{ \partial E }{ \partial y ^{ 2 } } + \ldots + w _{ i } ^{ j } \cdot \frac{ \partial E }{ \partial y ^{ j } } }

And with this, we are done, and we can use this as the derivative of the errors in terms of the output of the previous layer.

Updating the Parameters of a Fully Connected Layer

We can use the same logic for calculating the derivatives of the error in terms of the parameters. Starting with the weights:

yj=w0jx0+w1jx1+w2jx2++wijxi+bj\displaystyle{ y ^{ j } = w _{ 0 } ^{ j } x _{ 0 } + w _{ 1 } ^{ j } x _{ 1 } + w _{ 2 } ^{ j } x _{ 2 } + \ldots + w _{ i } ^{ j } x _{ i } + b ^{ j } }

The derivative yjwij\displaystyle{ \frac{ \partial y ^{ j } }{ \partial w _{ i } ^{ j } } } is simply xi\displaystyle{ x _{ i } } . Each weight is exclusive to one input and one output, so the relationship is very simple:

Variables in δE/δw

Our final calculation will be:

Ewij=yjwijEyj\displaystyle{ \frac{ \partial E }{ \partial w _{ i } ^{ j } } = \frac{ \partial y ^{ j } }{ \partial w _{ i } ^{ j } } \cdot \frac{ \partial E }{ \partial y ^{ j } } }
=xiEyj\displaystyle{ = x _{ i } \cdot \frac{ \partial E }{ \partial y ^{ j } } }

For the biases, it is equally simple:

yj=w0jx0+w1jx1+w2jx2++wijxi+bj\displaystyle{ y ^{ j } = w _{ 0 } ^{ j } x _{ 0 } + w _{ 1 } ^{ j } x _{ 1 } + w _{ 2 } ^{ j } x _{ 2 } + \ldots + w _{ i } ^{ j } x _{ i } + b ^{ j } }

Each bias is exclusive to an output neuron, with the derivative yjbj\displaystyle{ \frac{ \partial y ^{ j } }{ \partial b ^{ j } } } being just 1\displaystyle{ 1 } . The relationship is the same as the weight:

Variables in δE/δb

But our result will be even simpler:

Ebj=yjbjEyj\displaystyle{ \frac{ \partial E }{ \partial b ^{ j } } = \frac{ \partial y ^{ j } }{ \partial b ^{ j } } \cdot \frac{ \partial E }{ \partial y ^{ j } } }
=Eyj\displaystyle{ = \frac{ \partial E }{ \partial y ^{ j } } }

With both of our derivatives, we can now calculate the new values of the parameters through gradient descent:

wijwijαEwij\displaystyle{ w _{ i } ^{ j } \leftarrow w _{ i } ^{ j } - \alpha \cdot \frac{ \partial E }{ \partial w _{ i } ^{ j } } }
wijwijαxiEyj\displaystyle{ w _{ i } ^{ j } \leftarrow w _{ i } ^{ j } - \alpha \cdot x _{ i } \cdot \frac{ \partial E }{ \partial y ^{ j } } }

And:

bjbjαEbj\displaystyle{ b ^{ j } \leftarrow b ^{ j } - \alpha \cdot \frac{ \partial E }{ \partial b ^{ j } } }
bjbjαEyj\displaystyle{ b ^{ j } \leftarrow b ^{ j } - \alpha \cdot \frac{ \partial E }{ \partial y ^{ j } } }

Propagating Backwards the Error of an Activation Layer

For the final step of our puzzle, we just need to calculate the derivative of the error in terms of the input of the activation layer Exi\displaystyle{ \frac{ \partial E }{ \partial x _{ i } } } , knowing it in terms of the output Eyi\displaystyle{ \frac{ \partial E }{ \partial y ^{ i } } } . The formula for the Activation Layer is the simplest one yet:

yi=σ(xi)\displaystyle{ y ^{ i } = \sigma \left( x _{ i } \right) }

The derivative yixi\displaystyle{ \frac{ \partial y ^{ i } }{ \partial x _{ i } } } is equal to σ(xi)\displaystyle{ \sigma ^{ ^{\prime} } \left( x _{ i } \right) } . The relationship to the error is straightforward:

Propagating backwards an activation layer

With this, we can get our derivative:

Exi=yixiEyi\displaystyle{ \frac{ \partial E }{ \partial x _{ i } } = \frac{ \partial y ^{ i } }{ \partial x _{ i } } \cdot \frac{ \partial E }{ \partial y ^{ i } } }
=σ(xi)Eyi\displaystyle{ = \sigma ^{ ^{\prime} } \left( x _{ i } \right) \cdot \frac{ \partial E }{ \partial y ^{ i } } }

This is the last piece of knowledge needed to implement backpropagation. But, for clarity's sake, let's go through a simple example part by part.

Putting It Together: A Simple Example

To understand backpropagation, it is useful to go through a single iteration in a small neural network, like this one:

Neural networks

The dashed lines are from the activation layers, the values in the lines are the weights, and for simplicity, all of the biases will start at 0\displaystyle{ 0 } . The learning rate to be set to α=0.1\displaystyle{ \alpha = 0.1 } and out activation function to ReLU ( max(x,0)\displaystyle{ \max \left( x , 0 \right) } ). First, we need some data. Assume inputs are [1,0.5,1]\displaystyle{ \left[ 1 , 0.5 , 1 \right] } , and the expected output is [0.5,1]\displaystyle{ \left[ 0.5 , 1 \right] } . Evaluating the network does not give us the result we want:

Evaluated values of neural network

The output is not quite right. Our error is MSE=0.25\displaystyle{ M S E = 0.25 } . Now, we can update our parameters using backpropagation, step by step.

Step 1: Derivative of the Error in Terms of the Outputs

We can just use our formula:

Eyj=2yjy^jn\displaystyle{ \frac{ \partial E }{ \partial y ^{ j } } = 2 \frac{ y ^{ j } - \hat{ y } ^{ j } }{ n } }

Substituting in our values:

Eo0=200.52=0.5\displaystyle{ \frac{ \partial E }{ \partial o _{ 0 } } = 2 \frac{ 0 - 0.5 }{ 2 } = - 0.5 }
Eo1=20.512=0.5\displaystyle{ \frac{ \partial E }{ \partial o _{ 1 } } = 2 \frac{ 0.5 - 1 }{ 2 } = - 0.5 }

Step 2: Propagating Back the First Activation Layer

For this, we will need the derivative of our error function. In this case, the derivative of ReLU is:

σ(x)=ReLU(x)={0ifx<01ifx>0\displaystyle{ \sigma ^{ ^{\prime} } \left( x \right) = \text{ReLU} ^{ ^{\prime} } \left( x \right) = \left\lbrace \begin{array}{ll} 0 & \text{if}\quad x < 0 \\ 1 & \text{if}\quad x > 0 \end{array} \right. }

Now, we can use our formula:

Exi=σ(xi)Eyj\displaystyle{ \frac{ \partial E }{ \partial x _{ i } } = \sigma ^{ ^{\prime} } \left( x _{ i } \right) \cdot \frac{ \partial E }{ \partial y ^{ j } } }

And once again evaluate:

Ec0=σ(c0)Eo0\displaystyle{ \frac{ \partial E }{ \partial c _{ 0 } } = \sigma ^{ ^{\prime} } \left( c _{ 0 } \right) \cdot \frac{ \partial E }{ \partial o _{ 0 } } }
=σ(0.5)(0.5)=0\displaystyle{ = \sigma ^{ ^{\prime} } \left( - 0.5 \right) \cdot \left( - 0.5 \right) = 0 }

And:

Ec1=σ(c1)Eo1\displaystyle{ \frac{ \partial E }{ \partial c _{ 1 } } = \sigma ^{ ^{\prime} } \left( c _{ 1 } \right) \cdot \frac{ \partial E }{ \partial o _{ 1 } } }
=σ(0.5)(0.5)=0.5\displaystyle{ = \sigma ^{ ^{\prime} } \left( 0.5 \right) \cdot \left( - 0.5 \right) = - 0.5 }

Step 3: Propagating Back the First Fully Connected Layer

Using our formula to substitute:

Exi=wi0Ey0+wi1Ey1+wi2Ey2++wijEyj\displaystyle{ \frac{ \partial E }{ \partial x _{ i } } = w _{ i } ^{ 0 } \cdot \frac{ \partial E }{ \partial y ^{ 0 } } + w _{ i } ^{ 1 } \cdot \frac{ \partial E }{ \partial y ^{ 1 } } + w _{ i } ^{ 2 } \cdot \frac{ \partial E }{ \partial y ^{ 2 } } + \ldots + w _{ i } ^{ j } \cdot \frac{ \partial E }{ \partial y ^{ j } } }

We get:

Eb0=(1)Ec0+1Ec1\displaystyle{ \frac{ \partial E }{ \partial b _{ 0 } } = \left( - 1 \right) \cdot \frac{ \partial E }{ \partial c _{ 0 } } + 1 \cdot \frac{ \partial E }{ \partial c _{ 1 } } }
=(1)0+1(0.5)=0.5\displaystyle{ = \left( - 1 \right) \cdot 0 + 1 \cdot \left( - 0.5 \right) = - 0.5 }

Step 4: Updating the Parameters of the First Fully Connected Layer

Starting with the weights:

wijwijαxiEyj\displaystyle{ w _{ i } ^{ j } \leftarrow w _{ i } ^{ j } - \alpha \cdot x _{ i } \cdot \frac{ \partial E }{ \partial y ^{ j } } }

Applying it to our network, we get:

wb0c010.10.5Ec0=1\displaystyle{ w _{ b _{ 0 } } ^{ c _{ 0 } } \leftarrow - 1 - 0.1 \cdot 0.5 \cdot \frac{ \partial E }{ \partial c _{ 0 } } = - 1 }
wb0c110.10.5Ec1=1.025\displaystyle{ w _{ b _{ 0 } } ^{ c _{ 1 } } \leftarrow 1 - 0.1 \cdot 0.5 \cdot \frac{ \partial E }{ \partial c _{ 1 } } = 1.025 }

The biases are similar:

bjbjαEyj\displaystyle{ b ^{ j } \leftarrow b ^{ j } - \alpha \cdot \frac{ \partial E }{ \partial y ^{ j } } }

Substituting:

bc000.1Ec0=0\displaystyle{ b ^{ c _{ 0 } } \leftarrow 0 - 0.1 \cdot \frac{ \partial E }{ \partial c _{ 0 } } = 0 }
bc100.1Ec1=0.05\displaystyle{ b ^{ c _{ 1 } } \leftarrow 0 - 0.1 \cdot \frac{ \partial E }{ \partial c _{ 1 } } = 0.05 }

Step 5: Repeat

Now, we just need to repeat backwards:

Ea0=σ(0.5)(0.5)=0.5\displaystyle{ \frac{ \partial E }{ \partial a _{ 0 } } = \sigma ^{ ^{\prime} } \left( 0.5 \right) \cdot \left( - 0.5 \right) = - 0.5 }
wi0a00.50.11(0.5)=0.55\displaystyle{ w _{ i _{ 0 } } ^{ a _{ 0 } } \leftarrow 0.5 - 0.1 \cdot 1 \cdot \left( - 0.5 \right) = 0.55 }
wi1a010.10.5(0.5)=1.025\displaystyle{ w _{ i _{ 1 } } ^{ a _{ 0 } } \leftarrow 1 - 0.1 \cdot 0.5 \cdot \left( - 0.5 \right) = 1.025 }
wi2a00.50.11(0.5)=0.45\displaystyle{ w _{ i _{ 2 } } ^{ a _{ 0 } } \leftarrow - 0.5 - 0.1 \cdot 1 \cdot \left( - 0.5 \right) = - 0.45 }
ba000.1(0.5)=0.05\displaystyle{ b ^{ a _{ 0 } } \leftarrow 0 - 0.1 \cdot \left( - 0.5 \right) = 0.05 }

Calculating the derivative of the output in terms of the first layer is not necessary, because all the parameters were already updated. Our new parameters are wb0c0=1\displaystyle{ w _{ b _{ 0 } } ^{ c _{ 0 } } = - 1 } , wb0c1=1.025\displaystyle{ w _{ b _{ 0 } } ^{ c _{ 1 } } = 1.025 } , bc0=0\displaystyle{ b ^{ c _{ 0 } } = 0 } , bc1=0.05\displaystyle{ b ^{ c _{ 1 } } = 0.05 } , wi0a0=0.55\displaystyle{ w _{ i _{ 0 } } ^{ a _{ 0 } } = 0.55 } , wi1a0=1.025\displaystyle{ w _{ i _{ 1 } } ^{ a _{ 0 } } = 1.025 } , wi2a0=0.45\displaystyle{ w _{ i _{ 2 } } ^{ a _{ 0 } } = - 0.45 } , and ba0=0.05\displaystyle{ b ^{ a _{ 0 } } = 0.05 } . To train a neural network, we just need to do this many times with multiple data points.

Conclusion

Congratulations! If you followed everything so far, you have the knowledge necessary to implement backpropagation. I would recommend trying to replicate the formulas without looking at the article to better understand the process used to derive them, which is applicable to many other areas. If you want to check how this algorithm translates to code, check out my Rust implementation.

Next Steps

This article focuses on the math of the most basic way of implementing basic neural networks, so it is just a starting point for deep learning. From here, you might be interested in diving into:

Resources

To learn more about machine learning, Omar Aflak's Data Towards Science article is a good resource that explains how to derive the matrix version of the operations and places a greater emphasis on architecture and implementation (for example, it introduced me to the idea of splitting the activation layer). For video content, the Coding Lane channel has several series with advanced concepts. I recommend the Improving Neural Networks series, which covers regularization and optimizers.


contact