The Hessian Matrix
While the Gradient tells us the direction of the steepest slope, it doesn't tell us about the "shape" of the ground. Is the slope getting steeper or flatter? Are we in a narrow canyon or a wide, shallow bowl? To answer these questions, we need second-order derivatives, organized into the Hessian Matrix.
1. What is the Hessian?
The Hessian is a square matrix of second-order partial derivatives of a scalar-valued function. It describes the local curvature of the function.
If we have a function , the Hessian is an matrix:
If the second derivatives are continuous, the Hessian is a symmetric matrix (i.e., ). This makes it easier to work with using Linear Algebra tools like Eigen-decomposition.
2. Why does the Hessian matter in ML?
The Hessian helps us understand the "topography" of the Loss Function .
A. Determining Maxima and Minima
The gradient only tells us if the slope is zero (), but that could be a peak, a valley, or a saddle point. The Hessian tells us which one:
- Positive Definite Hessian: The surface curves upward in all directions (a Local Minimum).
- Negative Definite Hessian: The surface curves downward in all directions (a Local Maximum).
- Indefinite Hessian: The surface curves up in some directions and down in others (a Saddle Point).
B. Curvature and Learning Rates
The Hessian determines the "width" of the valley:
- High Curvature: A narrow, steep valley. If the learning rate is too high, Gradient Descent will bounce back and forth across the valley walls.
- Low Curvature: A wide, flat valley. Gradient Descent will move very slowly toward the bottom.
3. Second-Order Optimization
Standard Gradient Descent is a first-order method; it only uses the gradient. There are second-order methods, like Newton's Method, that use the Hessian to take much more efficient steps.
Instead of just moving in the negative gradient direction, Newton's method scales the step by the inverse of the Hessian:
In modern Deep Learning, the Hessian is rarely used directly. If a model has 10 million parameters, the Hessian matrix would have elements (100 trillion!), which is impossible to store in memory or invert. We use "quasi-Newton" methods or adaptive optimizers (like Adam) that approximate this curvature information.
4. Example Calculation
Let .
- First Partial Derivatives (Gradient):
- Second Partial Derivatives (Hessian):
The Hessian matrix is:
Now that we have covered the mathematics of change (Calculus), we need to look at the mathematics of uncertainty. This allows us to handle noisy data and make predictions with confidence.