Skip to main content

Chain Rule - The Engine of Backpropagation

The Chain Rule is a formula used to compute the derivative of a composite function, a function that is composed of one function inside another. If a function is built like a chain, the Chain Rule shows us how to differentiate it link by link.

This is arguably the most important calculus concept for Deep Learning, as the entire structure of a neural network is one massive composite function.

1. What is a Composite Function?

A composite function is one where the output of an inner function becomes the input of an outer function.

If yy is a function of uu, and uu is a function of xx, then yy is ultimately a function of xx.

y=f(u)whereu=g(x)y = f(u) \quad \text{where} \quad u = g(x)

The overall composite function is y=f(g(x))y = f(g(x)).

2. The Chain Rule Formula (Single Variable)

The Chain Rule states that the rate of change of yy with respect to xx is the product of the rate of change of yy with respect to uu, and the rate of change of uu with respect to xx.

dydx=dydududx\frac{dy}{dx} = \frac{dy}{du} \cdot \frac{du}{dx}

Example

Let y=(x2+1)3y = (x^2 + 1)^3. This can be written as y=u3y = u^3 where u=x2+1u = x^2 + 1.

  1. Find dydu\frac{dy}{du} (Outer derivative): dydu=ddu(u3)=3u2\frac{dy}{du} = \frac{d}{du}(u^3) = 3u^2
  2. Find dudx\frac{du}{dx} (Inner derivative): dudx=ddx(x2+1)=2x\frac{du}{dx} = \frac{d}{dx}(x^2 + 1) = 2x
  3. Apply the Chain Rule: dydx=dydududx=(3u2)(2x)\frac{dy}{dx} = \frac{dy}{du} \cdot \frac{du}{dx} = (3u^2) \cdot (2x)
  4. Substitute uu back: dydx=3(x2+1)22x=6x(x2+1)2\frac{dy}{dx} = 3(x^2 + 1)^2 \cdot 2x = 6x(x^2 + 1)^2

3. The Chain Rule with Multiple Variables (Partial Derivatives)

In neural networks, one variable can affect the final output through multiple different paths. This requires a slightly more complex version of the Chain Rule involving partial derivatives and summation.

If zz is a function of xx and yy, and both xx and yy are functions of tt: z=f(x,y)z = f(x, y), where x=g(t)x=g(t) and y=h(t)y=h(t).

The total derivative of zz with respect to tt is:

dzdt=zxdxdt+zydydt\frac{dz}{dt} = \frac{\partial z}{\partial x} \frac{dx}{dt} + \frac{\partial z}{\partial y} \frac{dy}{dt}

4. The Chain Rule and Backpropagation

Backpropagation (short for "backward propagation of errors") is the algorithm used to train neural networks. It is nothing more than the repeated application of the multivariate Chain Rule.

The Neural Network Chain

A neural network is a sequence of composite functions:

LossOutput LayerHidden Layer 2Hidden Layer 1Input\text{Loss} \leftarrow \text{Output Layer} \leftarrow \text{Hidden Layer 2} \leftarrow \text{Hidden Layer 1} \leftarrow \text{Input}

The goal is to calculate how a small change in a parameter (weight ww) in an early layer affects the final Loss JJ.

Jwearly=(JOutput)(OutputLayer 2)(Layer 2Layer 1)(Layer 1wearly)\frac{\partial J}{\partial w_{\text{early}}} = \left(\frac{\partial J}{\partial \text{Output}}\right) \cdot \left(\frac{\partial \text{Output}}{\partial \text{Layer } 2}\right) \cdot \left(\frac{\partial \text{Layer } 2}{\partial \text{Layer } 1}\right) \cdot \left(\frac{\partial \text{Layer } 1}{\partial w_{\text{early}}}\right)
Backpropagation Flow
  1. Forward Pass: Calculate the prediction and the Loss JJ.
  2. Backward Pass (Backprop): Start at the end of the chain (the Loss JJ) and calculate the partial derivatives (gradients) layer by layer, multiplying them backward toward the input.
  3. Update: Use the final calculated gradient Jw\frac{\partial J}{\partial w} to update the weight ww via Gradient Descent.

5. Summary of Calculus for ML

You have now covered the three foundational concepts of Calculus required for Machine Learning:

ConceptMathematical ToolML Application
Derivativesdfdx\frac{df}{dx}Measures the slope of the loss function.
Partial DerivativesJ\nabla J (The Gradient)Identifies the direction of steepest ascent in the high-dimensional loss surface.
Chain Ruledydx=dydududx\frac{dy}{dx} = \frac{dy}{du} \cdot \frac{du}{dx}Propagates the gradient backward through all layers of a neural network to calculate parameter updates.

With the mathematical foundations of Linear Algebra and Calculus established, we are now ready to tackle the core optimization algorithm that brings these concepts together: Gradient Descent.