Partial Derivatives
In the real world, and especially in Machine Learning, the functions we deal with rarely depend on just one variable. Our Loss Function depends on hundreds, even millions, of parameters (weights and biases). To navigate this high-dimensional space, we need the concept of a Partial Derivative.
1. Multi-Variable Functions in ML
Consider the simplest linear model. The predicted output is a function of the input feature , the weight , and the bias :
The Loss Function (e.g., Mean Squared Error) depends on the parameters and :
The goal is to adjust both and simultaneously to minimize . This requires us to know the rate of change of with respect to each parameter individually.
2. What is a Partial Derivative?
A partial derivative of a multi-variable function is simply the derivative of that function with respect to one variable, while treating all other variables as constants.
Notation
The partial derivative of a function with respect to is denoted by the curly symbol:
3. Calculating the Partial Derivative
The calculation uses all the standard derivative rules, but with the assumption that everything not being differentiated is a constant.
Example
Let the function be .
A. Partial Derivative with respect to :
We treat as a constant.
- (Power Rule)
- (Treat as the constant coefficient of )
- (Treat as a constant)
B. Partial Derivative with respect to :
We treat as a constant.
- (Treat as a constant)
- (Treat as the constant coefficient of )
- (Constant Multiple Rule)
4. The Gradient Vector
When we collect all the partial derivatives of a multi-variable function into a single vector, we get the Gradient of .
The gradient, denoted (read as "nabla J" or "del J"), is the vector of all partial derivatives:
Significance in ML
- Direction of Steepest Ascent: The gradient vector points in the direction of the steepest increase (the "uphill" direction) of the Loss Function .
- Gradient Descent: Since we want to minimize the loss, the Gradient Descent update rule is to move the parameter vector in the direction opposite to the gradient:
This single vector operation updates all weights and biases simultaneously, making the gradient the most fundamental mathematical quantity in training neural networks.
Now that we can find the direction of steepest ascent (the Gradient), we must ensure that the update rule accurately propagates this signal through the entire multi-layered network, which is the role of the Chain Rule.