Skip to main content

Partial Derivatives

In the real world, and especially in Machine Learning, the functions we deal with rarely depend on just one variable. Our Loss Function depends on hundreds, even millions, of parameters (weights and biases). To navigate this high-dimensional space, we need the concept of a Partial Derivative.

1. Multi-Variable Functions in ML

Consider the simplest linear model. The predicted output y^\hat{y} is a function of the input feature xx, the weight ww, and the bias bb:

y^(w,b)=wx+b\hat{y}(w, b) = w x + b

The Loss Function JJ (e.g., Mean Squared Error) depends on the parameters ww and bb:

J(w,b)=12mi=1m(y^(i)y(i))2=J(w,b)J(w, b) = \frac{1}{2m} \sum_{i=1}^{m} (\hat{y}^{(i)} - y^{(i)})^2 = J(w, b)

The goal is to adjust both ww and bb simultaneously to minimize JJ. This requires us to know the rate of change of JJ with respect to each parameter individually.

2. What is a Partial Derivative?

A partial derivative of a multi-variable function is simply the derivative of that function with respect to one variable, while treating all other variables as constants.

Notation

The partial derivative of a function f(x,y)f(x, y) with respect to xx is denoted by the curly \partial symbol:

fxorfx\frac{\partial f}{\partial x} \quad \text{or} \quad f_x

3. Calculating the Partial Derivative

The calculation uses all the standard derivative rules, but with the assumption that everything not being differentiated is a constant.

Example

Let the function be f(x,y)=3x2+5xy3+2yf(x, y) = 3x^2 + 5xy^3 + 2y.

A. Partial Derivative with respect to xx: fx\frac{\partial f}{\partial x}

We treat yy as a constant.

fx=x(3x2)+x(5y3x)+x(2y)\frac{\partial f}{\partial x} = \frac{\partial}{\partial x} (3x^2) + \frac{\partial}{\partial x} (5y^3 \cdot x) + \frac{\partial}{\partial x} (2y)
  1. x(3x2)=6x\frac{\partial}{\partial x} (3x^2) = 6x (Power Rule)
  2. x(5y3x)=5y3\frac{\partial}{\partial x} (5y^3 \cdot x) = 5y^3 (Treat 5y35y^3 as the constant coefficient of xx)
  3. x(2y)=0\frac{\partial}{\partial x} (2y) = 0 (Treat 2y2y as a constant)
fx=6x+5y3\frac{\partial f}{\partial x} = 6x + 5y^3

B. Partial Derivative with respect to yy: fy\frac{\partial f}{\partial y}

We treat xx as a constant.

fy=y(3x2)+y(5xy3)+y(2y)\frac{\partial f}{\partial y} = \frac{\partial}{\partial y} (3x^2) + \frac{\partial}{\partial y} (5x \cdot y^3) + \frac{\partial}{\partial y} (2y)
  1. y(3x2)=0\frac{\partial}{\partial y} (3x^2) = 0 (Treat 3x23x^2 as a constant)
  2. y(5xy3)=5x(3y2)=15xy2\frac{\partial}{\partial y} (5x \cdot y^3) = 5x \cdot (3y^2) = 15xy^2 (Treat 5x5x as the constant coefficient of y3y^3)
  3. y(2y)=2\frac{\partial}{\partial y} (2y) = 2 (Constant Multiple Rule)
fy=15xy2+2\frac{\partial f}{\partial y} = 15xy^2 + 2

4. The Gradient Vector

When we collect all the partial derivatives of a multi-variable function J(θ1,θ2,,θn)J(\theta_1, \theta_2, \ldots, \theta_n) into a single vector, we get the Gradient of JJ.

The gradient, denoted J\nabla J (read as "nabla J" or "del J"), is the vector of all partial derivatives:

J(θ)=[Jθ1Jθ2Jθn]\nabla J(\theta) = \begin{bmatrix} \frac{\partial J}{\partial \theta_1} \\ \frac{\partial J}{\partial \theta_2} \\ \vdots \\ \frac{\partial J}{\partial \theta_n} \end{bmatrix}

Significance in ML

  1. Direction of Steepest Ascent: The gradient vector J\nabla J points in the direction of the steepest increase (the "uphill" direction) of the Loss Function JJ.
  2. Gradient Descent: Since we want to minimize the loss, the Gradient Descent update rule is to move the parameter vector θ\mathbf{\theta} in the direction opposite to the gradient:
θnew=θoldαJ(θ)\mathbf{\theta}_{\text{new}} = \mathbf{\theta}_{\text{old}} - \alpha \nabla J(\mathbf{\theta})

This single vector operation updates all weights and biases simultaneously, making the gradient the most fundamental mathematical quantity in training neural networks.


Now that we can find the direction of steepest ascent (the Gradient), we must ensure that the update rule accurately propagates this signal through the entire multi-layered network, which is the role of the Chain Rule.