Skip to main content

The Jacobian Matrix

While the Gradient is used for functions that take a vector and return a single scalar (like a Loss Function), the Jacobian is the generalization of the derivative for functions that take a vector and return another vector.

In Deep Learning, almost every layer in a neural network is a vector-valued function. To pass gradients backward through these layers, we use the Jacobian.

1. What is the Jacobian?โ€‹

The Jacobian is a matrix of all first-order partial derivatives of a vector-valued function.

Suppose we have a function f\mathbf{f} that maps an input vector x\mathbf{x} of size nn to an output vector y\mathbf{y} of size mm:

y=f(x)wherexโˆˆRn,yโˆˆRm\mathbf{y} = \mathbf{f}(\mathbf{x}) \quad \text{where} \quad \mathbf{x} \in \mathbb{R}^n, \mathbf{y} \in \mathbb{R}^m

The Jacobian matrix J\mathbf{J} is an mร—nm \times n matrix where each entry (i,j)(i, j) represents how much the ii-th output changes with respect to the jj-th input.

J=โˆ‚yโˆ‚x=[โˆ‚y1โˆ‚x1โ€ฆโˆ‚y1โˆ‚xnโ‹ฎโ‹ฑโ‹ฎโˆ‚ymโˆ‚x1โ€ฆโˆ‚ymโˆ‚xn]\mathbf{J} = \frac{\partial \mathbf{y}}{\partial \mathbf{x}} = \begin{bmatrix} \frac{\partial y_1}{\partial x_1} & \dots & \frac{\partial y_1}{\partial x_n} \\ \vdots & \ddots & \vdots \\ \frac{\partial y_m}{\partial x_1} & \dots & \frac{\partial y_m}{\partial x_n} \end{bmatrix}

2. Why does the Jacobian matter in ML?โ€‹

The Jacobian is the mathematical bridge that allows the Chain Rule to work across entire layers of a neural network.

A. Backpropagation across Layersโ€‹

Imagine a layer in a network that takes an input vector hin\mathbf{h}_{in} and produces an output vector hout\mathbf{h}_{out}. During backpropagation, we receive the gradient of the loss LL with respect to the output: โˆ‚Lโˆ‚hout\frac{\partial L}{\partial \mathbf{h}_{out}}.

To continue the "chain" and find the gradient with respect to the input, we must multiply by the Jacobian of that layer:

โˆ‚Lโˆ‚hin=โˆ‚Lโˆ‚houtโ‹…J\frac{\partial L}{\partial \mathbf{h}_{in}} = \frac{\partial L}{\partial \mathbf{h}_{out}} \cdot \mathbf{J}

B. Activation Functionsโ€‹

When you apply an activation function like Sigmoid or ReLU to a vector, you are essentially creating a vector-to-vector mapping. The derivative of this mapping is a Jacobian matrix. For element-wise activations, this Jacobian is a diagonal matrix, which makes computation very efficient.

3. Example Calculationโ€‹

Let's say we have a function f(x1,x2)\mathbf{f}(x_1, x_2) that outputs a 2D vector:

  1. y1=x12+x2y_1 = x_1^2 + x_2
  2. y2=5x1+2x23y_2 = 5x_1 + 2x_2^3

To find the Jacobian J\mathbf{J}:

  • Row 1 (Derivatives of y1y_1):
    • โˆ‚y1โˆ‚x1=2x1\frac{\partial y_1}{\partial x_1} = 2x_1
    • โˆ‚y1โˆ‚x2=1\frac{\partial y_1}{\partial x_2} = 1
  • Row 2 (Derivatives of y2y_2):
    • โˆ‚y2โˆ‚x1=5\frac{\partial y_2}{\partial x_1} = 5
    • โˆ‚y2โˆ‚x2=6x22\frac{\partial y_2}{\partial x_2} = 6x_2^2

The resulting Jacobian matrix is:

J=[2x1156x22]\mathbf{J} = \begin{bmatrix} 2x_1 & 1 \\ 5 & 6x_2^2 \end{bmatrix}

4. Scaling the Chain Ruleโ€‹

In modern frameworks like PyTorch or TensorFlow, we rarely compute the full Jacobian matrix explicitly because it can be massive (e.g., 1ย millionย ร—1ย million1 \text{ million } \times 1 \text{ million} for a large layer).

Instead, these frameworks perform Vector-Jacobian Products (VJP). They directly calculate the result of vTJ\mathbf{v}^T \mathbf{J} (where v\mathbf{v} is the incoming gradient), which is much faster and uses less memory.


The Jacobian handles first-order changes. But to understand the "curvature" of our loss surfaceโ€”whether we are in a narrow valley or a wide bowlโ€”we need to look at second-order derivatives: The Hessian.