Backpropagation Algorithm
Explore the backpropagation algorithm that enables neural networks to learn by computing gradients of the loss function with respect to weights and biases. Understand the forward and backward passes, how the chain rule applies to gradient calculation, and how these updates optimize the network's performance through iterative training.
We'll cover the following...
Neural Networks (NN) are non-linear classifiers that can be formulated as a series of matrix multiplications. Just like linear classifiers, they can be trained using the same principles we followed before, namely the gradient descent algorithm. The difficulty arises in computing the gradients.
But first things first.
Let’s start with a straightforward example of a two-layered NN, with each layer containing just one neuron.
Notations
- The superscript defines the layer that we are in.
- denotes the activation of layer L.
- is a scalar weight of the layer L.
- is the bias term of layer L.
- is the cost function, is our target class, and is the activation function.
Forward pass
Our lovely model would look something like this in a simple sketch:
We can write the output of a neuron at layer as:
To simplify things, let’s define:
so that our basic equation will become:
We also know that our loss function is:
This is the so-called forward pass. We take some input and pass it through the network. From the output of the network, we can compute the loss .
Backward pass
Backward pass is the process of adjusting the weights in all the layers to minimize the loss .
To adjust the weights based on the training example, we can use our known update rule:
where is the learning rate that scales down the gradient.
It should be clear by now that the only thing left to compute is the gradient (the derivative of the loss with respect to the weight).
One way to think about computing is through the following diagram, which is called computational graph:
We summarize the performed operation in this way. To convert this into math, we need to revisit the chain rule.
The chain rule for the backward pass
To compute the gradient , our most useful tool is calculus and the chain rule. Using both, we can write:
It is evident that the final gradient is affected by the gradients of the previous neuron, which in turn is affected by the gradients of the one before. You can see that in order to compute the gradient, we need to go back (through the chain rule) all the way to the beginning of the network.
In other terms, we need to propagate the error backwards. This is how the backpropagation algorithm got its name.
To find the gradients, let’s compute all the subgradients. By using basic calculus, we get:
Combining them all together, we acquire our final gradient:
Similar equations can be derived for the biases. Instead of , we would have:
For completion, if we do the math, we get:
Now, we can adjust the weight and biases based on a single training example based on the update rule:
Next, we’ll feed the next example and readjust, and repeat. This is the infamous backpropagation.
You might argue that this is oversimplified because we only have 1 neuron. To be honest, not much will change if we add more neurons per layer. We will essentially conclude to the same equation.
Here is a neuron in the layer , while k a neuron in the layer .
And if we want to present the derivative in its final form, we have:
where:
Two final things to note here:
- The derivative with respect to the activation is a summation due to the fact that the activation of a neuron now depends on the activations of all the neurons on the previous layer.
- The same derivative also depends on the derivatives of the next layer’s activation (backpropagation of the error).
You now have a sense of how NNs learn, and that is no easy task.
Important note: We will not be computing gradients in every network that we define. The gradients are computed automatically in modern frameworks such as PyTorch.
No more partial derivatives!