Search⌘ K

Ridge and Lasso Regression

Learn about ridge and lasso regression, their comparison, and the importance of their contours’ intersection with MSE.

In the previous lesson, we learned that regularization helps control the bias–variance trade-off and prevents overfitting by adding a penalty on large weights. Now we will look at the two most widely used regularization methods:

  • Ridge (L2 regularization)

  • Lasso (L1 regularization)

Both techniques shrink the model’s weights, but the type of penalty they use leads to very different results. Understanding this difference is essential because the Lasso can eliminate features entirely, while the Ridge cannot, and this comes purely from geometry.

Ridge and Lasso objectives

Both Ridge and Lasso regression are special forms of regularized linear regression. They use the simplest model type (linear model) and the standard way to measure error (squared loss), differing only in their regularization penalty.

The core model and loss function

Before introducing the penalty, we must define the model that makes a prediction and the loss function that measures the error.

Linear model (fwf_{\mathbf{w}})

A linear model assumes the output (y^i\hat{y}_i, the prediction) is a simple, weighted sum of the inputs (xix_i). The goal is to find the best set of weights (w\mathbf{w}) that connect the inputs to the output.

  • We have nn training examples, D={(xi,yi)1in}D = \{(\mathbf{x}_i, y_i) \mid 1 \le i \le n\}. Each input xi\mathbf{x}_i has dd features.
  • The model expression:

fw(xi)=w0+w1xi1+w2xi2++wdxidf_{\mathbf{w}}(\mathbf{x}_i) = w_0 + w_1 x_{i1} + w_2 x_{i2} + \cdots + w_d x_{id}

  • w0w_0 is the intercept (or bias).
  • w1w_1 to wdw_d are the slopes or feature weights.

To simplify the math, we often combine w0w_0 with the other weights by adding a constant 11 to the start of the feature vector: x^i=(1,xi1,,xid)\hat{\mathbf{x}}_i = (1, x_{i1}, \dots, x_{id}) ...