Ridge Regression (L2 Regularization)
Ridge Regression is an extension of linear regression that adds a regularization term to the cost function. It is specifically designed to handle overfitting and issues caused by multicollinearity (when input features are highly correlated).
1. The Mathematical Objective
In standard OLS (Ordinary Least Squares), the model only cares about minimizing the error. In Ridge Regression, we add a "penalty" proportional to the square of the magnitude of the coefficients ().
The cost function becomes:
- MSE (Mean Squared Error): The standard loss (prediction error).
- (Alpha): The complexity parameter. It controls how much you want to penalize the size of the coefficients.
- : The L2 norm. Squaring the coefficients ensures they stay small but rarely hit exactly zero.
2. Why use Ridge?
A. Preventing Overfitting
When a model has too many features or the features are highly correlated, the coefficients () can become very large. This makes the model extremely sensitive to small fluctuations in the training data. Ridge "shrinks" these coefficients, making the model more stable.
B. Handling Multicollinearity
If two variables are nearly identical (e.g., height in inches and height in centimeters), standard regression might assign one a massive positive weight and the other a massive negative weight. Ridge forces the weights to be distributed more evenly and kept small.
3. The Alpha () Trade-off
Choosing the right is a balancing act between Bias and Variance:
- : Equivalent to standard Linear Regression (High variance, Low bias).
- As : The penalty dominates. Coefficients approach zero, and the model becomes a flat line (Low variance, High bias).
4. Implementation with Scikit-Learn
from sklearn.linear_model import Ridge
from sklearn.preprocessing import StandardScaler
# 1. Scaling is MANDATORY for Ridge
# Because the penalty is based on the size of coefficients,
# features with larger scales will be penalized unfairly.
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
# 2. Initialize and Train
# alpha=1.0 is the default
ridge = Ridge(alpha=1.0)
ridge.fit(X_train_scaled, y_train)
# 3. Predict
y_pred = ridge.predict(X_test_scaled)
5. Ridge vs. Lasso: A Summary
| Feature | Ridge Regression () | Lasso Regression () |
|---|---|---|
| Penalty Term | ||
| Mathematical Goal | Minimizes the square of the weights. | Minimizes the absolute value of the weights. |
| Coefficient Shrinkage | Shrinks coefficients asymptotically toward zero, but they rarely reach exactly zero. | Can shrink coefficients exactly to zero, effectively removing the feature. |
| Feature Selection | No. Keeps all predictors in the final model, though some may have tiny weights. | Yes. Acts as an automated feature selector by discarding unimportant variables. |
| Model Complexity | Produces a dense model (uses all features). | Produces a sparse model (uses a subset of features). |
| Ideal Scenario | When you have many features that all contribute a small amount to the output. | When you have many features, but only a few are actually significant. |
| Handling Correlated Data | Very stable; handles multicollinearity by distributing weights across correlated features. | Less stable; if features are highly correlated, it may randomly pick one and zero out the others. |
6. RidgeCV: Finding the Best Alpha
Finding the perfect manually is tedious. Scikit-Learn provides RidgeCV, which uses built-in cross-validation to find the optimal alpha for your specific dataset automatically.
from sklearn.linear_model import RidgeCV
# Define a list of alphas to test
alphas = [0.1, 1.0, 10.0, 100.0]
# RidgeCV finds the best one automatically
ridge_cv = RidgeCV(alphas=alphas)
ridge_cv.fit(X_train_scaled, y_train)
print(f"Best Alpha: {ridge_cv.alpha_}")
References for More Details
- Scikit-Learn: Linear Models: Technical details on the solvers used (like 'cholesky' or 'sag').