Chain Rule - The Engine of Backpropagation
The Chain Rule is a formula used to compute the derivative of a composite function, a function that is composed of one function inside another. If a function is built like a chain, the Chain Rule shows us how to differentiate it link by link.
This is arguably the most important calculus concept for Deep Learning, as the entire structure of a neural network is one massive composite function.
1. What is a Composite Function?
A composite function is one where the output of an inner function becomes the input of an outer function.
If is a function of , and is a function of , then is ultimately a function of .
The overall composite function is .
2. The Chain Rule Formula (Single Variable)
The Chain Rule states that the rate of change of with respect to is the product of the rate of change of with respect to , and the rate of change of with respect to .
Example
Let . This can be written as where .
- Find (Outer derivative):
- Find (Inner derivative):
- Apply the Chain Rule:
- Substitute back:
3. The Chain Rule with Multiple Variables (Partial Derivatives)
In neural networks, one variable can affect the final output through multiple different paths. This requires a slightly more complex version of the Chain Rule involving partial derivatives and summation.
If is a function of and , and both and are functions of : , where and .
The total derivative of with respect to is:
4. The Chain Rule and Backpropagation
Backpropagation (short for "backward propagation of errors") is the algorithm used to train neural networks. It is nothing more than the repeated application of the multivariate Chain Rule.
The Neural Network Chain
A neural network is a sequence of composite functions:
The goal is to calculate how a small change in a parameter (weight ) in an early layer affects the final Loss .
- Forward Pass: Calculate the prediction and the Loss .
- Backward Pass (Backprop): Start at the end of the chain (the Loss ) and calculate the partial derivatives (gradients) layer by layer, multiplying them backward toward the input.
- Update: Use the final calculated gradient to update the weight via Gradient Descent.
5. Summary of Calculus for ML
You have now covered the three foundational concepts of Calculus required for Machine Learning:
| Concept | Mathematical Tool | ML Application |
|---|---|---|
| Derivatives | Measures the slope of the loss function. | |
| Partial Derivatives | (The Gradient) | Identifies the direction of steepest ascent in the high-dimensional loss surface. |
| Chain Rule | Propagates the gradient backward through all layers of a neural network to calculate parameter updates. |
With the mathematical foundations of Linear Algebra and Calculus established, we are now ready to tackle the core optimization algorithm that brings these concepts together: Gradient Descent.