Skip to main content

Derivatives - The Rate of Change

Calculus is the mathematical foundation for optimization in Machine Learning. Specifically, Derivatives are the primary tool used to train almost every ML model, from Linear Regression to complex Neural Networks, via algorithms like Gradient Descent.

1. What is a Derivative?

The derivative of a function measures the instantaneous rate of change of that function. Geometrically, the derivative at any point on a curve is the slope of the tangent line to the curve at that point.

Formal Definition

The derivative of a function f(x)f(x) with respect to xx is defined using limits:

f(x)=dydx=limh0f(x+h)f(x)hf'(x) = \frac{dy}{dx} = \lim_{h \to 0} \frac{f(x+h) - f(x)}{h}
  • dydx\frac{dy}{dx} is the common notation, read as "the derivative of yy with respect to xx."
  • The expression f(x+h)f(x)h\frac{f(x+h) - f(x)}{h} is the slope of the secant line between xx and x+hx+h.
  • Taking the limit as hh approaches zero gives the exact slope of the tangent line at xx.

2. Derivatives in Machine Learning: Optimization

In Machine Learning, we define a Loss Function (or Cost Function) J(θ)J(\theta) which measures the error of our model, where θ\theta represents the model's parameters (weights and biases).

The goal of training is to find the parameter values θ\theta that minimize the loss function.

A. Finding the Minimum

  1. A function's minimum (or maximum) occurs where the slope is zero.
  2. The derivative tells us the slope.
  3. Therefore, by setting the derivative dJdθ\frac{dJ}{d\theta} to zero, we can find the optimal parameters θ\theta.

B. Gradient Descent

For most complex ML models, the loss function is too complex to solve by setting the derivative to zero directly. Instead, we use an iterative process called Gradient Descent.

The derivative dJdθ\frac{dJ}{d\theta} tells us two things:

  • Magnitude: How steep the slope is (how quickly the loss is changing).
  • Direction (Sign): Whether moving parameter θ\theta in a positive direction will increase or decrease the loss.

In Gradient Descent, we update the parameter θ\theta in the opposite direction of the derivative (down the slope) to find the minimum:

θnew=θoldαdJdθ\theta_{\text{new}} = \theta_{\text{old}} - \alpha \frac{dJ}{d\theta}
  • α\alpha (alpha) is the learning rate (a small scalar).
  • dJdθ\frac{dJ}{d\theta} is the derivative (the slope/gradient).

3. Basic Differentiation Rules

You must be familiar with the following rules to understand how derivatives are calculated for model training.

Rule NameFunction f(x)f(x)Derivative ddxf(x)\frac{d}{dx}f(x)Example
Constant Rulecc00ddx(5)=0\frac{d}{dx}(5) = 0
Power Rulexnx^nnxn1nx^{n-1}ddx(x3)=3x2\frac{d}{dx}(x^3) = 3x^2
Constant Multiplecf(x)c \cdot f(x)cf(x)c \cdot f'(x)ddx(4x2)=8x\frac{d}{dx}(4x^2) = 8x
Sum/Differencef(x)±g(x)f(x) \pm g(x)f(x)±g(x)f'(x) \pm g'(x)ddx(x23x)=2x3\frac{d}{dx}(x^2 - 3x) = 2x - 3
Exponentialexe^xexe^x

Example: Quadratic Loss

Linear Regression often uses Mean Squared Error (MSE), which is a quadratic function of the weights ww.

Let the simplified loss function be J(w)=w2+4w+1J(w) = w^2 + 4w + 1. We apply the Sum and Power Rules:

dJdw=ddw(w2)+ddw(4w)+ddw(1)=2w+4+0=2w+4\frac{dJ}{dw} = \frac{d}{dw}(w^2) + \frac{d}{dw}(4w) + \frac{d}{dw}(1) = 2w + 4 + 0 = 2w + 4

If the current weight is w=1w=1, the slope is 2(1)+4=62(1) + 4 = 6 (steep, positive).

References and Resources

To solidify your understanding of differentiation, here are some excellent resources:

  • Khan Academy - Differential Calculus: Comprehensive video tutorials covering limits, derivatives, and rules. Excellent for visual learners.
  • Calculus: Early Transcendentals by James Stewart (or any similar major textbook): Provides rigorous definitions and practice problems.
  • The Calculus of Computation by Lars Kristensen: A good resource that connects calculus directly to computational methods.

Most functions in ML depend on more than one parameter (e.g., w1,w2,biasw_1, w_2, \text{bias}). To find the slope in these multi-variable spaces, we must use Partial Derivatives.