Trusted answers to developer questions

Muhammad Nabeel

Grokking Modern System Design Interview for Engineers & Managers

Ace your System Design Interview and take your career to the next level. Learn to handle the design of applications like Netflix, Quora, Facebook, Uber, and many more in a 45-min interview. Learn the RESHADED framework for architecting web-scale applications by determining requirements, constraints, and assumptions before diving into a step-by-step design process.

**Weight decay **is a widely used type of regularization. It is also known as

When training our model, we often run into the problem of overfitting, where our model performs perfectly on the training data but fails to generalize well enough to be practically useful.

There are many different ways to counter overfitting. One of the most common misconceptions is to increase the number of samples in the dataset. This may be true in some scenarios, but often people waste a lot of resources in collecting new data. Let's take an example of polynomial fitting to understand.

Evidently, the figure on the left generalizes well, but the figure on the right gives better results on training data. Let's say that figure (a) fits using the following hypothesis function:

Here,

Notice that the only difference is that model in figure (b) has more features, making it harder for our model to generalize. It can easily be solved by making them zero. This can be applied to logistic regression as well.

Note:To figure out overfitting, here we have plotted our hypothesis. This may not be possible in other problems as there can be hundreds of features and plotting such a big plot is impossible.

One way to counter the problem of overfitting would be to just remove the unneeded features. However, in weight decay, we penalize the parameters corresponding to those features by making them close to zero instead. Let's take the cost function for regression and add the regularization term to it.

Here,

Note:We do not penalize the parameter$\theta_0$ as it is a convention and makes little to no difference.

Similarly, the cost function for logistic regression with weight decay would be:

Since weight decay is said to be

The only difference in the equation from before is that we are taking the** **absolute magnitude** **of the parameters instead of squaring them. The same goes for logistic regression.

The difference between

- It is easier to calculate the derivative of
$l_2$ regularization. - The
$l_2$ regularization gives better performance in prediction.

Just like $L_1$ and $L_2$ regularizations are regularizations used during model training. What could be the regularization technique on the data part for any application?

Show Answer

RELATED TAGS

CONTRIBUTOR

Muhammad Nabeel

Copyright Â©2022 Educative, Inc. All rights reserved

Grokking Modern System Design Interview for Engineers & Managers

Ace your System Design Interview and take your career to the next level. Learn to handle the design of applications like Netflix, Quora, Facebook, Uber, and many more in a 45-min interview. Learn the RESHADED framework for architecting web-scale applications by determining requirements, constraints, and assumptions before diving into a step-by-step design process.

Keep Exploring

Related Courses