The Jacobian Matrix
While the Gradient is used for functions that take a vector and return a single scalar (like a Loss Function), the Jacobian is the generalization of the derivative for functions that take a vector and return another vector.
In Deep Learning, almost every layer in a neural network is a vector-valued function. To pass gradients backward through these layers, we use the Jacobian.
1. What is the Jacobian?โ
The Jacobian is a matrix of all first-order partial derivatives of a vector-valued function.
Suppose we have a function that maps an input vector of size to an output vector of size :
The Jacobian matrix is an matrix where each entry represents how much the -th output changes with respect to the -th input.
2. Why does the Jacobian matter in ML?โ
The Jacobian is the mathematical bridge that allows the Chain Rule to work across entire layers of a neural network.
A. Backpropagation across Layersโ
Imagine a layer in a network that takes an input vector and produces an output vector . During backpropagation, we receive the gradient of the loss with respect to the output: .
To continue the "chain" and find the gradient with respect to the input, we must multiply by the Jacobian of that layer:
B. Activation Functionsโ
When you apply an activation function like Sigmoid or ReLU to a vector, you are essentially creating a vector-to-vector mapping. The derivative of this mapping is a Jacobian matrix. For element-wise activations, this Jacobian is a diagonal matrix, which makes computation very efficient.
3. Example Calculationโ
Let's say we have a function that outputs a 2D vector:
To find the Jacobian :
- Row 1 (Derivatives of ):
- Row 2 (Derivatives of ):
The resulting Jacobian matrix is:
4. Scaling the Chain Ruleโ
In modern frameworks like PyTorch or TensorFlow, we rarely compute the full Jacobian matrix explicitly because it can be massive (e.g., for a large layer).
Instead, these frameworks perform Vector-Jacobian Products (VJP). They directly calculate the result of (where is the incoming gradient), which is much faster and uses less memory.
The Jacobian handles first-order changes. But to understand the "curvature" of our loss surfaceโwhether we are in a narrow valley or a wide bowlโwe need to look at second-order derivatives: The Hessian.