Regularization (Lasso, Ridge, and ElasticNet Regression)

Learn more about Regularization. Specifically, it helps us deal with overfitting problems in Machine Learning models.


We use overfitting to describe when the model learning is performing well on the training dataset but fails to generalize on the unseen or test dataset. This condition is also mentioned because the model is suffering from high variance. Overfitting on the training data can be illustrated as:

J(w)0J(w) \approx 0

In other words, our predicted values are so close to the actual values, that the cost goes to zero and the model has memorized everything.

How high variance (overfitting) can be reduced

  • The first strategy is to look for more training data so that the data has more variety in it.

  • Regularization, which will be the focus of this part of the lesson is also used to tackle overfitting.

  • Employ good Feature Selection techniques.

  • There are also some specific Deep Learning techniques for reducing the high variance.

Now, we will look into how various Regularizations are used to overcome overfitting.

Ridge Regression

The following steps demonstrate how the cost function is modified in Ridge Regression, sometimes called L2-Regularization.

J(w)J(w) = 12m[i=1m(y^iyi)2+λj=1nwj2]\frac{1}{2m}[\sum_{i=1}^{m}(\hat{y}^i-y^i)^2 + \lambda \sum_{j=1}^{n}w_j^2]

  • In Ridge Regression, we minimize the above function.

  • λ\lambda is called the regularization parameter.

  • Choosing too high of a λ\lambda value can cause the parameters (w1,w2...)(w_1, w_2 ...) to have a low value, resulting in underfitting (also called High Bias) because the model won’t perform well on the training dataset. Notice that the parameter w0w_0 is not included in this regularization procedure, meaning it’s value remains unaffected from the regularization.

  • Choosing too small of λ\lambda values can cause the term λj=1nwj2\lambda \sum_{j=1}^{n}w_j^2 to have negligible effect on the parameters (w1,w2,..)(w_1, w_2, ..) and this will convert to Linear Regression itself. So, choosing the λ\lambda parameter also comes in the hyper-parameter optimization.

Ridge Regression in Scikit Learn

Ridge class is used for making Ridge Regression model.

Get hands-on with 1200+ tech skills courses.