Overfitting and Underfitting
Explore the concepts of overfitting and underfitting in supervised learning. Learn to identify when a model is too simple or too complex, how these issues impact training and testing performance, and strategies to balance bias and variance for better generalization on unseen data.
In machine learning, the ultimate goal isn't just to memorize past data; it's to make accurate predictions on new, unseen data. If a model performs perfectly on the data it was trained on but fails when deployed in the real world, it’s useless.
The concepts of overfitting and underfitting describe two major failure modes when attempting to achieve this crucial ability, which we refer to as generalization.
What is overfitting?
Overfitting is a modeling error where the model learns the training data (including its accidental irregularities or noise) too well, failing to capture the broad, underlying pattern. This results in a model that performs exceptionally well on the training data but poorly on any new data.
We can relate this to rote learning in a student: if a student only memorizes the solution to a specific practice problem, they might ace that problem, but if the problem is slightly changed (new data), they fail because they missed the underlying mathematical principle (the pattern).
-
Cause: We choose a model that is too flexible (has too many parameters) relative to the size and complexity of the training data. For example, a 10-degree polynomial is far more flexible than a 2-degree polynomial.
-
Result: High performance on training data, but low performance (high error) on testing data
The concept of model flexibility and its trade-off is often best seen through the lens of Polynomial Regression, where we use higher powers of the input feature () to fit a curve.
The illustration above shows the three stages as the complexity of the polynomial function increases:
Stage | Complexity (Polynomial Degree) | Description | Loss on Data |
Underfitting | Low (e.g., degree 1) | The line is too simple and misses the fundamental trend of the data points. | High training and testing loss |
Good fit (Generalization) | Medium (e.g., degree 7) | The curve matches the main trend without following every minor point. | Low training and testing loss |
Overfitting | High (e.g., degree 10) | The curve becomes erratic, bending sharply to pass exactly through every training point, including the noise. | Very low training loss, high testing loss |
Try it yourself
We will now implement a small model using Linear Regression and Polynomial Features to clearly visualize the effect of model complexity (polynomial degree) on training and testing error.
Our goal is to approximate the function with a polynomial curve.
Import packages and define function
First, we import the necessary libraries and define the target function .
Generating and splitting data
We generate 100 synthetic data points and split them into two distinct sets: a training set (for the model to learn from) and a testing set (for evaluating the model’s performance on unseen data).
The explanation for the code above is given as follows:
-
Line 2:
np.linspace(0, 10, n_total)creates evenly spaced -coordinates between 0 and 10. -
Line 3: The true output is mixed with
np.random.randn(n_total)(noise from a standard normal distribution) to create realistic, noisy data (). -
Line 4: We use
np.newaxisto turn a 1D array into a 2D column vector. This line reshapes the 1-D arrayx_total(shape:(100,)) into a 2-D column vector (shape:(100, 1)), which is the format required for most ML models. -
Line 5:
train_test_splitis used to divide data into two parts: a training set (for learning) and a testing set (for unbiased evaluation). We usetest_size=0.8, meaning only (20 points) are used for training, making it easier to see overfitting later.
Fitting the polynomial model
We now construct the model. We use make_pipeline to combine two steps into one: first, creating the polynomial features, and second, running the linear regression.
The make_pipeline is a utility that helps combine multiple steps into a single, streamlined step. Here, it ensures that when data is fed to the model, it is first transformed into polynomial features (e.g., becomes ) before being passed to the LinearRegression model for learning.
Computing loss and plotting
Finally, we calculate the MSE for both the training set and the crucial testing set and plot the results to visualize the model’s fit.
Putting it all together
Enter the degree of the polynomial you want to fit in the data:
-
Try a small degree: You will see the curve is very stiff and misses the sinusoidal shape. Both MSE (Train) and MSE (Test) will be high. This is underfitting.
-
Try a medium degree: The curve will follow the overall sinusoidal pattern well. Both MSE (Train) and MSE (Test) will be low and close to each other. This is the sweet spot for a good fit (generalization).
-
Try a high degree (e.g., 11 or 15): The curve will wiggle drastically to hit every single training point. MSE (Train) will be very low, but MSE (Test) will be very high because the wiggles poorly represent the true underlying function. This is Overfitting.
Generalization
To avoid overfitting, good generalization is what a model must aspire to achieve. Generalization is the model’s ability to adapt to unseen data. We can think of generalization as the model’s performance after deployment when new data comes in. Consider an example of a face recognition-based access control system. Face images from different viewpoints of an authorized person are captured by a camera in the registration process, which makes the training data. After deployment, the same person’s face is captured by the camera, likely resulting in a face image that isn’t identical to any face image in the training data. The system should be capable of recognizing the new face image that has small and novel variations in viewpoints or lighting conditions.
The following illustration highlights overfitting and generalization in a single frame:
In the illustration above, the blue dots represent the actual data points the model learns from.
The green curve shows good generalization: it captures the smooth underlying pattern without chasing noise, so it performs well on new data. The red curve shows overfitting: a model that bends too sharply to match every fluctuation in the training set.
Although it fits the training data perfectly, it memorizes noise and performs poorly on unseen data.
What is underfitting?
The other extreme is underfitting, where the model fails to perform well on training data. While overfitting makes the model fit too closely, underfitting is the model’s inability to grasp the relationship between input and output. In underfitting, the training and validation/testing errors are large.
In the image, the blue dots represent the actual training data, following a non-linear, U-shaped pattern. The red line shows an underfitted model that is too simple to capture this pattern, resulting in
Underfitting vs. overfitting (Bias-variance tradeoff)
In machine learning, one of the common problems a model may face is underfitting vs. overfitting. An ideal model is neither underfitted nor overfitted. A sweet spot exists where the model has reduced training and testing errors.
When a model is underfitted, it’s said to have high bias. When the model is overfitted, it has high variance.
-
Bias measures the error introduced by approximating a real-world problem (which may be complex) with a simpler model (e.g., trying to fit a curve with a straight line). High bias leads to underfitting.
-
Variance measures how much a model’s predictions change with small differences in the training data. High variance indicates overfitting: the model fits the training data too closely but performs poorly on new, unseen data.
The goal is to achieve the lowest combined error by finding a balance between bias and variance. To clearly differentiate these two failure modes and understand the sweet spot, let’s look at a side-by-side comparison:
Feature | Underfitting (High Bias) | Overfitting (High Variance) | Ideal Fit (Sweet Spot) |
Model complexity | Too simple (e.g., Linear) | Too complex (e.g., High-degree polynomial) | Just right |
Training error | High | Very low | Low |
Testing error | High | Very high | Lowest |
Learning failure | Misses the fundamental pattern (Underlying trend) | Learns the noise (Accidental irregularities) | Learns the true pattern |
Visual example | Straight line missing a curve | Wobbling curve hitting every point | Smooth curve fitting the trend |
How to address overfitting and underfitting
Addressing overfitting and underfitting issues requires different strategies: improving model complexity, regularization, or adjusting the training data can help achieve better generalization. Selecting the right approach ensures the model balances bias and variance for optimal performance on unseen data.
Issue | Techniques to Address |
Underfitting |
|
Overfitting |
|
Conclusion
The key lesson from overfitting and underfitting is that the goal of machine learning is not just to minimize training error, but to achieve strong generalization, the ability to perform well on unseen data.
Underfitting occurs when a model is too simple to capture the underlying pattern, leading to high errors on both training and testing data. Overfitting happens when a model is too complex, memorizing noise in the training set and performing poorly on new data.
The ideal model strikes a balance between bias and variance, capturing the true relationship while ignoring noise. Techniques like regularization help manage model complexity and prevent overfitting when training data is limited.