Trusted answers to developer questions
Trusted Answers to Developer Questions

Related Tags

What is weight decay?

Muhammad Nabeel

Grokking Modern System Design Interview for Engineers & Managers

Ace your System Design Interview and take your career to the next level. Learn to handle the design of applications like Netflix, Quora, Facebook, Uber, and many more in a 45-min interview. Learn the RESHADED framework for architecting web-scale applications by determining requirements, constraints, and assumptions before diving into a step-by-step design process.

Weight decay is a widely used type of regularization. It is also known as l2l_2 regularization. But before we dive into weight decay, let's understand why we need regularization in the first place.

When training our model, we often run into the problem of overfitting, where our model performs perfectly on the training data but fails to generalize well enough to be practically useful.

There are many different ways to counter overfitting. One of the most common misconceptions is to increase the number of samples in the dataset. This may be true in some scenarios, but often people waste a lot of resources in collecting new data. Let's take an example of polynomial fitting to understand.

(a)(b)

Evidently, the figure on the left generalizes well, but the figure on the right gives better results on training data. Let's say that figure (a) fits using the following hypothesis function:

Here, hθh_{\theta} is the hypothesis function with θ\theta as the model parameters and xx as the inputs. Similarly, the hypothesis function for figure (b) would be:

Notice that the only difference is that model in figure (b) has more features, making it harder for our model to generalize. It can easily be solved by making them zero. This can be applied to logistic regression as well.

(a)(b)

Note: To figure out overfitting, here we have plotted our hypothesis. This may not be possible in other problems as there can be hundreds of features and plotting such a big plot is impossible.

Weight decay

One way to counter the problem of overfitting would be to just remove the unneeded features. However, in weight decay, we penalize the parameters corresponding to those features by making them close to zero instead. Let's take the cost function for regression and add the regularization term to it.

Here, J(θ)J(\theta) is the cost function with nn number of parameters. The term λ\lambda is called the regularization parameter and mm is the total number of training examples. The choice of λ\lambda will determine how we penalize the parameters. If it is too large, it will lead to underfitting and if it is too small, it will lead to overfitting.

Note: We do not penalize the parameter θ0\theta_0 as it is a convention and makes little to no difference.

Similarly, the cost function for logistic regression with weight decay would be:

Comparison with l1l_1 regularization

Since weight decay is said to be l2l_2 regularization, you must be wondering what l1l_1 regularization is like. Let's briefly look at what l1l_1 regularization is for regression.

The only difference in the equation from before is that we are taking the absolute magnitude of the parameters instead of squaring them. The same goes for logistic regression.

The difference between l1l_1 regularization and l2l_2 regularization is that l1l_1 regularization performs feature selection. Only the weights of the useful features get to live, the rest of them are made zero. Whereas, in l2l_2 regularization, the parameters are made close to zero but not zero. The key benefits of using l2l_2 regularization are as follows:

  • It is easier to calculate the derivative of l2l_2 regularization.
  • The l2l_2 regularization gives better performance in prediction.
Question

Just like L1L_1 and L2L_2 regularizations are regularizations used during model training. What could be the regularization technique on the data part for any application?

Show Answer

RELATED TAGS

CONTRIBUTOR

Muhammad Nabeel
Copyright ©2022 Educative, Inc. All rights reserved

Grokking Modern System Design Interview for Engineers & Managers

Ace your System Design Interview and take your career to the next level. Learn to handle the design of applications like Netflix, Quora, Facebook, Uber, and many more in a 45-min interview. Learn the RESHADED framework for architecting web-scale applications by determining requirements, constraints, and assumptions before diving into a step-by-step design process.

Keep Exploring