Deriving Backpropagation from Scratch
Backpropagation is the backbone of training deep neural networks. In this post, we will unravel the mathematics behind this critical algorithm, starting with the basics and building up to the full derivation.
The Basic Idea
The goal of training a neural network is to minimize a loss function \(\mathcal{L}\), which quantifies how well the network’s predictions match the true labels. We do this by iteratively adjusting the weights of the network in the direction that reduces the loss. The backpropagation algorithm helps us find the gradient of the loss with respect to each weight, \(\frac{\partial \mathcal{L}}{\partial w}\)`, so that we can update the weights accordingly.
Chain Rule
The crux of backpropagation lies in the application of the chain rule from calculus. Since the loss function \(\mathcal{L}\)` is a composite function involving many variables, we can express its gradient as a product of gradients of individual functions using the chain rule.
For example, if we have an output \(o\) which is a function of hidden layer \(h\) and \(h\) is a function of the weight \(w), then we can write:
$$ \frac{\partial \mathcal{L}}{\partial w} = \frac{\partial \mathcal{L}}{\partial o} \frac{\partial o}{\partial h} \frac{\partial h}{\partial w} $$
Applying the Chain Rule to Neural Networks
In a neural network, we have layers of neurons connected by weights and biases. We apply activation functions at each layer, and the final output is computed through a series of nested functions.
Step 1: Compute the Gradient of the Loss with respect to the Output
The first step is to find the gradient of the loss with respect to the output:
$$ \frac{\partial \mathcal{L}}{\partial o} $$
This will depend on the specific form of the loss function.
Step 2: Compute the Gradients for Hidden Layers
Next, we need to find the gradients at the hidden layers. This involves computing the derivative of the activation functions and iterating through the hidden layers using the chain rule:
$$ \frac{\partial \mathcal{L}}{\partial h} = \frac{\partial \mathcal{L}}{\partial o} \frac{\partial o}{\partial h} $$
Step 3: Compute the Gradients with respect to Weights
Finally, we can find the gradient with respect to the weights:
$$ \frac{\partial \mathcal{L}}{\partial w} = \frac{\partial \mathcal{L}}{\partial h} \frac{\partial h}{\partial w} $$
A Detailed Derivation with Example
Let’s delve into the derivation of backpropagation by considering a simple neural network with one hidden layer. We’ll use mean squared error as our loss function and a sigmoid activation function.
Mathematical Derivation
Let’s denote:
- \( x \): input
- \( y \): true label
- \( o \): predicted output
- \( h \): hidden layer output
- \( w_1 \): weights between the input and hidden layer
- \( w_2 \): weights between the hidden layer and output
- \( \mathcal{L} = \frac{1}{2}(y - o)^2 \): mean squared error loss
We’ll compute the gradients for the weights, \( \frac{\partial \mathcal{L}}{\partial w_1} \) and \( \frac{\partial \mathcal{L}}{\partial w_2} \), using the chain rule.
For \( w_2 \):
$$ \frac{\partial \mathcal{L}}{\partial w_2} = \frac{\partial \mathcal{L}}{\partial o} \frac{\partial o}{\partial w_2} = -(y - o) \cdot h $$
For \( w_1 \):
$$ \frac{\partial \mathcal{L}}{\partial w_1} = \frac{\partial \mathcal{L}}{\partial h} \frac{\partial h}{\partial w_1} = \frac{\partial \mathcal{L}}{\partial o} \frac{\partial o}{\partial h} \frac{\partial h}{\partial w_1} = -(y - o) \cdot w_2 \cdot h \cdot (1-h) \cdot x $$
Python Code with NumPy
Now, let’s translate this into code. Here’s an example of backpropagation for this neural network using Python and NumPy:
import numpy as np
# Sigmoid function
def sigmoid(x):
return 1 / (1 + np.exp(-x))
# Derivative of the sigmoid function
def sigmoid_derivative(x):
return x * (1 - x)
# Input and true label
x = np.array([0.5])
y = np.array([0.7])
# Initialize weights
w1 = np.random.rand(1, 5)
w2 = np.random.rand(5, 1)
# Forward pass
h = sigmoid(np.dot(x, w1))
o = sigmoid(np.dot(h, w2))
# Compute loss
loss = 0.5 * (y - o)**2
# Backward pass
d_loss_o = -(y - o)
d_loss_w2 = np.dot(h.T, d_loss_o)
d_loss_h = np.dot(d_loss_o, w2.T)
d_loss_w1 = np.dot(x.T, d_loss_h * sigmoid_derivative(h))
# Update weights
learning_rate = 0.1
w2 -= learning_rate * d_loss_w2
w1 -= learning_rate * d_loss_w1
This code implements a single forward and backward pass for our example neural network. It can be extended to multiple layers and full training by embedding the forward and backward passes inside loops to iterate over epochs and training samples.
Conclusion
Backpropagation is an elegant application of the chain rule to the structure of neural networks. By breaking down the loss function into a series of partial derivatives, we can efficiently compute the gradients needed to update the weights of the network. This fundamental algorithm enables the training of deep neural networks, allowing them to learn complex patterns and make powerful predictions.