Understanding Backpropagation with Simple Derivatives
Backpropagation is the backbone of deep learning,
responsible for the training phase of neural networks.
Despite its importance,
it is often left out of most explanations of neural networks
due to its complexity.
Many summaries of backpropagation
either assume a level of background knowledge that is unnecessary
for understanding the algorithm
or focus explicitly on implementation,
leaving a working program but no insight.
This article will explain backpropagation
using the bare minimum amount of knowledge required
—in this case, only basic derivatives—
and to an extent in which you can comfortably
implement your own neural networks.
For this, a few detours are needed,
starting with a refresher on neural networks.
Neural Networks
The basic neural network structure consists of a series of layers
made up of multiple neurons.
Each neuron is connected to every neuron in the previous layer by a weight,
which represents the strength of the connection.
A small neural network may look something like this:
The value of each neuron is calculated by adding up the
values of the previous layer's neurons multiplied with their respective weights,
plus an extra term called the bias (b).
All of this is passed through an activation function
(generally written as σ)
giving us a final result.
For example, the value of z0 can be calculated as:
z0=σ(w0y0+w1y1+w2y2+b)
Where wi is the weight of the connection between z0 and yi,
and b is the bias of z0.
The activation function is there to make sure the neural network
is not just a big system of linear equations
by adding some extra complexity.
Popular examples include max(0,x), known as ReLU, and tanh(x).
To evaluate a neural network,
you simply set the first layer's neurons to the input
and then calculate the value of each neuron layer by layer
until the last one, which is the output.
This is called forward propagation.
The problem then becomes finding a set of weights and biases
that gives us the output we want.
To do this, we first need to define our needs explicitly
with a loss function.
A common goal for a network is to learn what to do
based on a set of examples;
that is, we know the correct results and want the network to replicate and extrapolate them.
This falls under supervised learning
and has the advantage of having an easy way of spelling out what we want.
We simply need to minimize the difference between the outputs we want and the outputs we get.
There are many functions to calculate this error,
one of the simplest being the Mean Squared Error (MSE, for short),
which is exactly what it sounds like:
MSE=n(y1−y^1)2+(y2−y^2)2+…+(yn−y^n)2
=n1i=1∑n(yi−y^i)2
Where y^i is the expected output and yi is the real one.
For example, if we wanted the two outputs to be [−0.5,1],
but we got [0,0.5],
we would calculate the error in these steps:
MSE=2(0−(−0.5))2+(0.5−1)2
=20.25+0.25
=0.25
Now, we can finally set our goal to finding a set of parameters
that minimizes the MSE.
To do this, we first need to look at the simplest case:
minimizing a function from a single parameter.
The simplest algorithm for this task is called gradient descent
and will be our starting point for machine learning.
Gradient Descent
Gradient descent is an algorithm that helps you
find the minimum of a function with a small number of evaluations,
as long as you know the function's derivative.
To understand how it works,
consider the simple function f(x)=x2,
and how its slope changes depending on where you are.
If the minimum is to the left of the x value,
the slope is positive.
If it is to its right, it is negative.
Finally, if the x value is spot on,
the slope is just 0.
This gives us reason to try subtracting the slope
from the x value to get closer to the minimum.
This is the essence of gradient descent.
An extra factor, called the learning rate (α),
is added to control the size of each step.
Each iteration of the algorithm looks like this:
x←x−α⋅f′(x)
For example, starting with x=1
and using a learning of α=0.3,
we can minimize f(x)=x2 using these calculations:
x←x−α⋅f′(x)
x←1−0.3⋅f′(1)=0.4
x←0.4−0.3⋅f′(0.4)=0.16
x←0.016−0.3⋅f′(0.16)=0.064
As we can see, x is slowly converging into the minimum of 0.
We can get some more insight by visualizing the algorithm.
For example, this is how minimizing f(x)=x3+3x2 looks
with a small learning rate of 0.04:
There are two notable behaviors.
First, the steeper the slope, the faster the x value moves.
Second, the algorithm does not actually converge to the minimum of the function
but rather the closest local minimum.
If the starting point was just an inch more negative,
the algorithm would just end up going to the left indefinitely.
Choosing the right learning rate is usually a process of trial and error.
The last example used an extremely small one to make the animation smoother,
making the convergence slow.
Nevertheless, you can overdo it,
as learning rates that are too high make the algorithm bounce back and forth.
For example, using α=0.45 on the last example makes it behave erratically:
Now that we know how to minimize a function,
we can set our goal to minimizing our error.
The only thing we need is the derivative of the error E
in terms of each parameter p,
which we can write as dpdE for now.
Then, we can just update each parameter with gradient descent:
p←p−αdpdE
This means we can reduce our problem to finding the derivative of the error
with respect to each parameter.
But, because our neural network has a large number of parameters
that affect each other in complex ways,
we first need to understand how derivatives
work with multiple variables.
Derivatives with Multiple Variables
Consider the basic derivative,
where a function is defined in terms of one variable,
and we need to calculate its derivative.
For example:
f(x)=x3
f′(x)=3x2
First, we should stop using the function notation:
y=x3
dxdy=3x2
This gives us a few benefits,
starting with being able to calculate the derivative
the other way around:
x=3y
dydx=33y21=y−23/3
We should take a minute to analyze what these values mean.
If you change x by a small value ϵ,
then y will be changed by ϵ⋅3x2.
Similarly, if you change y by a small value ϵ,
then x will be changed by ϵ⋅y−23/3.
We can visualize these derivatives like this:
We can now get started with more variables.
Take this simple system:
b=3a4
c=sin(b)
We can see the relationship using the same visualization:
We can easily calculate that dadb=12a3 and dbdc=cos(b),
but what about dbdc?
First of all, when working with multiple variables,
we usually change the d to a ∂.
An approach feasible for simple systems is solving ∂a∂c
manually using the chain rule:
c=sin(b)=sin(3a4)
∂a∂c=cos(3a4)⋅12a3=cos(b)⋅12a3
But consider the meaning of the graph with our derivatives:
If each small change ϵ in a
causes a ϵ⋅12a3 change in b,
and each small change δ in b
causes a δ⋅cos(b) change in c,
then we just have a total (ϵ⋅12a3)⋅cos(b)
change in c.
This logic holds up in general,
resulting in a new rule for derivatives involving
three variables, a, b, and c,
defined sequentially in terms of one another:
∂a∂c=∂b∂c⋅∂a∂b
Naturally, we can reuse this property multiple times.
For example, with four variables:
First, we can calculate ∂a∂d as ∂b∂d⋅∂a∂b.
The process is then repeated for ∂b∂d=∂c∂d⋅∂b∂c,
yielding a final answer of ∂a∂d=∂c∂d⋅∂b∂c⋅∂a∂b.
For more variables, the result is the same: the product of the derivatives of neighboring variables.
This is the generalized version of the chain rule.
We are almost done with multi-variable derivatives,
except for a special case.
Say you have:
b=a2
c=−7a
d=b⋅c
How do you calculate ∂a∂d? The shape of the graph is different:
Consider each one of the paths,
∂a∂b⋅∂b∂d=2a⋅c (in blue),
and ∂a∂c⋅∂c∂d=−7⋅b (in green).
You can understand them as the effect a has over d
through b and c, respectively.
So, if a small change ϵ causes a
ϵ⋅2a⋅c change in d through b,
and that same change also causes a
ϵ⋅(−7)⋅b change in d through c,
then, the total change is simply their sum
ϵ⋅2a⋅c+ϵ⋅(−7)⋅b=ϵ⋅(2ac−7b).
The derivative is ∂a∂d=∂a∂b⋅∂b∂d+∂a∂c⋅∂c∂d=2ac−7b.
So, generally, if we have a variable that affects another through different paths,
we simply add the derivatives through each path to get our result.
With this, we are finally ready to tackle how neural networks learn.
Training Neural Networks
We now just need to apply gradient descent to each of the weights and biases
of the neural network.
Unfortunately, a problem quickly arises.
Say you wanted to calculate the derivative of the error in terms of the green weight (∂w∂E)
to do a step of gradient descent
in this simple neural network:
It is not a trivial calculation;
there is a complex path with multiple intermediate variables (in red).
All for just one weight in one iteration of a toy neural network.
There is also no intuitive generalization for all weights
that we can simplify into a simple expression.
Imagine this, but with a million parameters.
The solution comes from the simple observation that
every parameter is connected to the error exclusively
through the neurons of the next layer.
Everything from then on forward can be treated as a black box.
This gives us reason to try to calculate the derivative of the error
in terms of every neuron of the final layer,
and use those values to calculate the derivative of the error
in terms of the second to last layer,
and so on, working backwards from the error.
This is the essence of backpropagation,
an algorithm which follows these steps:
Calculate the derivative of the error in terms of the output of the final layer.
Calculate the derivative of the error in terms of the output of the previous layer.
Calculate the derivative of the parameters of the layer.
Apply a step of gradient descent using those derivatives.
Repeat steps 2-4 for all layers, going backwards.
But before that, we need to take some time to establish the notation system.
When talking about an individual layer,
the ith input
(i.e., the activation of the ith neuron of the previous layer)
will be written as xi,
and the jth output will be written as yj.
The bias of the jth neuron will be called bj,
and the weight from xi to yj will be wij.
Notice that the subscripts correspond to the input space
and the superscripts to the output space.
We can make a small simplification to make our jobs easier.
We currently define the outputs of one of our layers like this:
yj=σ(w0jx0+w1jx1+w2jx2+…+wijxi+bj)
We are doing two different tasks here:
the weighted sum and the activation function.
We can split them into different layers.
The first one is called a Fully Connected Layer,
and the outputs can be calculated as the sum:
yj=w0jx0+w1jx1+w2jx2+…+wijxi+bj
It has i⋅j weights, and j biases,
and looks like this:
The second one is an Activation Layer,
which simply applies an activation function
to its input:
yi=σ(xi)
It has a simpler structure and no trainable parameters:
Now, we can start tackling each of the steps,
and then joining them.
Derivative of the Error
Our first step will be to calculate the derivative of the error
in terms of the outputs yj of the network.
This calculation is trivial:
E=n(y0−y^0)2+(y1−y^1)2+…+(yj−y^j)2
∂yj∂E=2nyj−y^j
Note: The formula was changed to be zero-indexed.
Propagating Backwards the Error of a Fully Connected Layer
To allow the computation of the previous layers,
we need to calculate the derivative of the inputs ∂xi∂E
knowing the derivative of the outputs ∂yj∂E.
Take a look at our formula for the output:
yj=w0jx0+w1jx1+w2jx2+…+wijxi+bj
We can easily compute ∂xi∂yj to be wij.
Also, all of the input neurons are present when calculating each output neuron.
Because the input neurons are connected to the error solely through the output neurons,
we have this situation:
And with this, we are done,
and we can use this as the derivative of the errors
in terms of the output of the previous layer.
Updating the Parameters of a Fully Connected Layer
We can use the same logic for calculating the derivatives of the
error in terms of the parameters.
Starting with the weights:
yj=w0jx0+w1jx1+w2jx2+…+wijxi+bj
The derivative ∂wij∂yj is simply xi.
Each weight is exclusive to one input and one output,
so the relationship is very simple:
Our final calculation will be:
∂wij∂E=∂wij∂yj⋅∂yj∂E
=xi⋅∂yj∂E
For the biases, it is equally simple:
yj=w0jx0+w1jx1+w2jx2+…+wijxi+bj
Each bias is exclusive to an output neuron,
with the derivative ∂bj∂yj being just 1.
The relationship is the same as the weight:
But our result will be even simpler:
∂bj∂E=∂bj∂yj⋅∂yj∂E
=∂yj∂E
With both of our derivatives,
we can now calculate the new values of the parameters through gradient descent:
wij←wij−α⋅∂wij∂E
wij←wij−α⋅xi⋅∂yj∂E
And:
bj←bj−α⋅∂bj∂E
bj←bj−α⋅∂yj∂E
Propagating Backwards the Error of an Activation Layer
For the final step of our puzzle,
we just need to calculate the derivative of the error
in terms of the input of the activation layer ∂xi∂E,
knowing it in terms of the output ∂yi∂E.
The formula for the Activation Layer is the simplest one yet:
yi=σ(xi)
The derivative ∂xi∂yi is equal to σ′(xi).
The relationship to the error is straightforward:
With this, we can get our derivative:
∂xi∂E=∂xi∂yi⋅∂yi∂E
=σ′(xi)⋅∂yi∂E
This is the last piece of knowledge needed to implement backpropagation.
But, for clarity's sake,
let's go through a simple example part by part.
Putting It Together: A Simple Example
To understand backpropagation,
it is useful to go through a single iteration
in a small neural network,
like this one:
The dashed lines are from the activation layers,
the values in the lines are the weights,
and for simplicity, all of the biases will start at 0.
The learning rate to be set to α=0.1
and out activation function to ReLU (max(x,0)).
First, we need some data.
Assume inputs are [1,0.5,1],
and the expected output is [0.5,1].
Evaluating the network does not give us the result we want:
The output is not quite right.
Our error is MSE=0.25.
Now, we can update our parameters using backpropagation,
step by step.
Step 1: Derivative of the Error in Terms of the Outputs
We can just use our formula:
∂yj∂E=2nyj−y^j
Substituting in our values:
∂o0∂E=220−0.5=−0.5
∂o1∂E=220.5−1=−0.5
Step 2: Propagating Back the First Activation Layer
For this, we will need the derivative of our error function.
In this case, the derivative of ReLU is:
σ′(x)=ReLU′(x)={01ifx<0ifx>0
Now, we can use our formula:
∂xi∂E=σ′(xi)⋅∂yj∂E
And once again evaluate:
∂c0∂E=σ′(c0)⋅∂o0∂E
=σ′(−0.5)⋅(−0.5)=0
And:
∂c1∂E=σ′(c1)⋅∂o1∂E
=σ′(0.5)⋅(−0.5)=−0.5
Step 3: Propagating Back the First Fully Connected Layer
Step 4: Updating the Parameters of the First Fully Connected Layer
Starting with the weights:
wij←wij−α⋅xi⋅∂yj∂E
Applying it to our network, we get:
wb0c0←−1−0.1⋅0.5⋅∂c0∂E=−1
wb0c1←1−0.1⋅0.5⋅∂c1∂E=1.025
The biases are similar:
bj←bj−α⋅∂yj∂E
Substituting:
bc0←0−0.1⋅∂c0∂E=0
bc1←0−0.1⋅∂c1∂E=0.05
Step 5: Repeat
Now, we just need to repeat backwards:
∂a0∂E=σ′(0.5)⋅(−0.5)=−0.5
wi0a0←0.5−0.1⋅1⋅(−0.5)=0.55
wi1a0←1−0.1⋅0.5⋅(−0.5)=1.025
wi2a0←−0.5−0.1⋅1⋅(−0.5)=−0.45
ba0←0−0.1⋅(−0.5)=0.05
Calculating the derivative of the output in terms of the
first layer is not necessary,
because all the parameters were already updated.
Our new parameters are
wb0c0=−1,
wb0c1=1.025,
bc0=0,
bc1=0.05,
wi0a0=0.55,
wi1a0=1.025,
wi2a0=−0.45, and
ba0=0.05.
To train a neural network,
we just need to do this many times with multiple data points.
Conclusion
Congratulations!
If you followed everything so far,
you have the knowledge necessary to implement backpropagation.
I would recommend trying to replicate the formulas without looking at the article
to better understand the process used to derive them,
which is applicable to many other areas.
If you want to check how this algorithm
translates to code,
check out my Rust implementation.
Next Steps
This article focuses on the math of the most basic way of implementing basic neural networks,
so it is just a starting point for deep learning.
From here, you might be interested in diving into:
Matrices: Our algorithm can be expressed more elegantly using matrices,
with a huge boost in performance if the work is offloaded to the GPU.
Other Layers: The two layers we have learned so far are very powerful,
but other types have their own advantages.
For example, convolutional layers are widely used in image processing.
Optimizers: Gradient descent is the basis for almost all ML optimization algorithms,
but by itself it is rather inefficient.
Today, a plethora of algorithms that converge faster are available.
Regularization: Larger neural networks can suffer several flaws,
including overfitting
(a tendency to memorize the data rather than learn the fundamental pattern).
Regularization techniques like dropout help counter these problems.
Different Architectures: For some tasks, new architectures are needed.
For example, when working with sequences, recurrent neural networks (RNNs)
are commonly used, which can hold on to memory.
Unlabeled Data: Sometimes you do not know the ideal output,
so you require a new way of learning.
For example, for learning to play games reinforcement learning is used,
using rewards and punishments to steer the network towards desired behaviors.
Likewise, for compressing data into a smaller size,
the autoencoder architecture is useful.
Resources
To learn more about machine learning,
Omar Aflak's Data Towards Science article is a good resource
that explains how to derive the matrix version of the operations
and places a greater emphasis on architecture and implementation
(for example, it introduced me to the idea of splitting the activation layer).
For video content,
the Coding Lane channel
has several series with advanced concepts.
I recommend the Improving Neural Networks series,
which covers regularization and optimizers.