Search⌘ K
AI Features

Overfitting and Underfitting

Explore the concepts of overfitting and underfitting in supervised learning. Learn to identify when a model is too simple or too complex, how these issues impact training and testing performance, and strategies to balance bias and variance for better generalization on unseen data.

In machine learning, the ultimate goal isn't just to memorize past data; it's to make accurate predictions on new, unseen data. If a model performs perfectly on the data it was trained on but fails when deployed in the real world, it’s useless.

The concepts of overfitting and underfitting describe two major failure modes when attempting to achieve this crucial ability, which we refer to as generalization.

What is overfitting?

Overfitting is a modeling error where the model learns the training data (including its accidental irregularities or noise) too well, failing to capture the broad, underlying pattern. This results in a model that performs exceptionally well on the training data but poorly on any new data.

We can relate this to rote learning in a student: if a student only memorizes the solution to a specific practice problem, they might ace that problem, but if the problem is slightly changed (new data), they fail because they missed the underlying mathematical principle (the pattern).

  • Cause: We choose a model that is too flexible (has too many parameters) relative to the size and complexity of the training data. For example, a 10-degree polynomial is far more flexible than a 2-degree polynomial.

  • Result: High performance on training data, but low performance (high error) on testing data

The concept of model flexibility and its trade-off is often best seen through the lens of Polynomial Regression, where we use higher powers of the input feature (x,x2,x3,x, x^2, x^3, \dots) to fit a curve.

The illustration above shows the three stages as the complexity of the polynomial function increases:

Stage

Complexity (Polynomial Degree)

Description

Loss on Data

Underfitting

Low (e.g., degree 1)

The line is too simple and misses the fundamental trend of the data points.

High training and testing loss

Good fit (Generalization)

Medium (e.g., degree 7)

The curve matches the main trend without following every minor point.

Low training and testing loss

Overfitting

High (e.g., degree 10)

The curve becomes erratic, bending sharply to pass exactly through every training point, including the noise.

Very low training loss, high testing loss

Try it yourself

We will now implement a small model using Linear Regression and Polynomial Features to clearly visualize the effect of model complexity (polynomial degree) on training and testing error.

Our goal is to approximate the function f(x)=xsin(x)f(x) = x \sin(x) with a polynomial curve.

Import packages and define function

First, we import the necessary libraries and define the target function f(x)=xsin(x)f(x) = x \sin(x).

Python 3.5
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression as LR
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split
# Keeping the seed value constant for reproducible results
np.random.seed(1)
# Function we are trying to approximate: f(x) = x * sin(x)
def f(x):
return x * np.sin(x)

Generating and splitting data

We generate 100 synthetic data points and split them into two distinct sets: a training set (for the model to learn from) and a testing set (for evaluating the model’s performance on unseen data).

Python
n_total = 100
x_total = np.linspace(0, 10, n_total)
y = f(x_total) + np.random.randn(n_total)
X = x_total[:,np.newaxis]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.8, random_state=42)
idx = np.argsort(X_train[:,0])
X_train[:,0], y_train = X_train[idx,0], y_train[idx]
n_train, n_test = X_train.shape[0], X_test.shape[0]

The explanation for the code above is given as follows:

  • Line 2: np.linspace(0, 10, n_total) creates 100100 evenly spaced xx-coordinates between 0 and 10.

  • Line 3: The true output f(x)f(x) is mixed with np.random.randn(n_total) (noise from a standard normal distribution) to create realistic, noisy data (yy).

  • Line 4: We use np.newaxis to turn a 1D array into a 2D column vector. This line reshapes the 1-D array x_total (shape: (100,)) into a 2-D column vector (shape: (100, 1)), which is the format required for most ML models.

  • Line 5: train_test_split is used to divide data into two parts: a training set (for learning) and a testing set (for unbiased evaluation). We use test_size=0.8, meaning only 20%20\% (20 points) are used for training, making it easier to see overfitting later.

Fitting the polynomial model

We now construct the model. We use make_pipeline to combine two steps into one: first, creating the polynomial features, and second, running the linear regression.

Python
model = make_pipeline(PolynomialFeatures(degree), LR())
model.fit(X_train, y_train)

The make_pipeline is a utility that helps combine multiple steps into a single, streamlined step. Here, it ensures that when data is fed to the model, it is first transformed into polynomial features (e.g., xx becomes 1,x,x2,x3,1, x, x^2, x^3, \dots) before being passed to the LinearRegression model for learning.

Computing loss and plotting

Finally, we calculate the MSE for both the training set and the crucial testing set and plot the results to visualize the model’s fit.

Python
# Computing MSE for training and testing data
train_loss = np.sum((model.predict(X_train)-y_train)**2)/n_train
test_loss = np.sum((model.predict(X_test)-y_test)**2)/n_test
# Plotting
fig, ax = plt.subplots()
ax.plot(x_total, f(x_total), linewidth=1, label="ground truth", color = "green")
ax.scatter(X_train[:,0], y_train,label="training points", color = '#65b2ff', edgecolors='black')
ax.plot(X_train[:,0], model.predict(X_train), label=f"degree {degree}",linestyle='--', color = "darkorange")
plt.grid(color = 'b', linestyle = '--', linewidth = 0.3)
plt.xlabel(r'$x$',fontsize=20)
plt.ylabel(r'$\hat y$',fontsize=20)
plt.title('MSE(Train) = {:.2f}'.format(train_loss) + ':: MSE(test) = {:.2f}'.format(test_loss),fontsize=20)
ax.legend(loc="lower center")
ax.set_ylim(-20, 10)

Putting it all together

Enter the degree of the polynomial you want to fit in the data:

  • Try a small degree: You will see the curve is very stiff and misses the sinusoidal shape. Both MSE (Train) and MSE (Test) will be high. This is underfitting.

  • Try a medium degree: The curve will follow the overall sinusoidal pattern well. Both MSE (Train) and MSE (Test) will be low and close to each other. This is the sweet spot for a good fit (generalization).

  • Try a high degree (e.g., 11 or 15): The curve will wiggle drastically to hit every single training point. MSE (Train) will be very low, but MSE (Test) will be very high because the wiggles poorly represent the true underlying function. This is Overfitting.

Python 3.10.4
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression as LR
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split
# Keeping the seed value constant to observe effect of model complexity
np.random.seed(1)
# Setting sinusoidal distribution for generating synthetic data
def f(x):
return x * np.sin(x)
# Inputting the degree testue in increasing or decreasing order to understand its consequences
degree = int(input())
# Generating 100 synthetic data points, which have sinusoidal distribution, and making its splits
n_total = 100
x_total = np.linspace(0, 10, n_total)
y = f(x_total) + np.random.randn(n_total)
X = x_total[:,np.newaxis]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.8, random_state=42)
idx = np.argsort(X_train[:,0])
X_train[:,0], y_train = X_train[idx,0], y_train[idx]
n_train, n_test = X_train.shape[0], X_test.shape[0]
# Constructing an automated workflow for training the model
model = make_pipeline(PolynomialFeatures(degree), LR())
model.fit(X_train, y_train)
# Computing MSE for training and testing data
train_loss = np.sum((model.predict(X_train)-y_train)**2)/n_train
test_loss = np.sum((model.predict(X_test)-y_test)**2)/n_test
# Plotting
fig, ax = plt.subplots()
ax.plot(x_total, f(x_total), linewidth=1, label="ground truth", color = "green")
ax.scatter(X_train[:,0], y_train,label="training points", color = '#65b2ff', edgecolors='black')
ax.plot(X_train[:,0], model.predict(X_train), label=f"degree {degree}",linestyle='--', color = "darkorange")
plt.grid(color = 'b', linestyle = '--', linewidth = 0.3)
plt.xlabel(r'$x$',fontsize=20)
plt.ylabel(r'$\hat y$',fontsize=20)
plt.title('MSE(Train) = {:.2f}'.format(train_loss) + ':: MSE(test) = {:.2f}'.format(test_loss),fontsize=20)
ax.legend(loc="lower center")
ax.set_ylim(-20, 10)
Did you find this helpful?

Generalization

To avoid overfitting, good generalization is what a model must aspire to achieve. Generalization is the model’s ability to adapt to unseen data. We can think of generalization as the model’s performance after deployment when new data comes in. Consider an example of a face recognition-based access control system. Face images from different viewpoints of an authorized person are captured by a camera in the registration process, which makes the training data. After deployment, the same person’s face is captured by the camera, likely resulting in a face image that isn’t identical to any face image in the training data. The system should be capable of recognizing the new face image that has small and novel variations in viewpoints or lighting conditions.

The following illustration highlights overfitting and generalization in a single frame:

In the illustration above, the blue dots represent the actual data points the model learns from.

The green curve shows good generalization: it captures the smooth underlying pattern without chasing noise, so it performs well on new data. The red curve shows overfitting: a model that bends too sharply to match every fluctuation in the training set.

Although it fits the training data perfectly, it memorizes noise and performs poorly on unseen data.

What is underfitting?

The other extreme is underfitting, where the model fails to perform well on training data. While overfitting makes the model fit too closely, underfitting is the model’s inability to grasp the relationship between input and output. In underfitting, the training and validation/testing errors are large. In the image, the blue dots represent the actual training data, following a non-linear, U-shaped pattern. The red line shows an underfitted model that is too simple to capture this pattern, resulting in high biasHigh bias means the model is too simple to capture the underlying patterns in the data. and poor performance on both training and test data. The green curve shows a model with the right level of complexity that captures the underlying U-shaped trend, learning the true relationship and performing well on both seen and unseen data.

Underfitting vs. overfitting (Bias-variance tradeoff)

In machine learning, one of the common problems a model may face is underfitting vs. overfitting. An ideal model is neither underfitted nor overfitted. A sweet spot exists where the model has reduced training and testing errors.

When a model is underfitted, it’s said to have high bias. When the model is overfitted, it has high variance.

  • Bias measures the error introduced by approximating a real-world problem (which may be complex) with a simpler model (e.g., trying to fit a curve with a straight line). High bias leads to underfitting.

  • Variance measures how much a model’s predictions change with small differences in the training data. High variance indicates overfitting: the model fits the training data too closely but performs poorly on new, unseen data.

The goal is to achieve the lowest combined error by finding a balance between bias and variance. To clearly differentiate these two failure modes and understand the sweet spot, let’s look at a side-by-side comparison:

Feature

Underfitting (High Bias)

Overfitting (High Variance)

Ideal Fit (Sweet Spot)

Model complexity

Too simple

(e.g., Linear)

Too complex (e.g., High-degree polynomial)

Just right

Training error

High

Very low

Low

Testing error

High

Very high

Lowest

Learning failure

Misses the fundamental pattern (Underlying trend)

Learns the noise

(Accidental irregularities)

Learns the true pattern

Visual example

Straight line missing a curve

Wobbling curve hitting every point

Smooth curve fitting the trend

How to address overfitting and underfitting

Addressing overfitting and underfitting issues requires different strategies: improving model complexity, regularization, or adjusting the training data can help achieve better generalization. Selecting the right approach ensures the model balances bias and variance for optimal performance on unseen data.

Issue

Techniques to Address

Underfitting

  • Increase model complexity (more features, deeper trees, larger networks)
  • Reduce regularization
  • Feature engineering or adding relevant data

Overfitting

  • Apply regularization (L1/L2, dropout)
  • Reduce model complexity
  • Use more training data or data augmentation
  • Early stopping, cross-validation

Conclusion

The key lesson from overfitting and underfitting is that the goal of machine learning is not just to minimize training error, but to achieve strong generalization, the ability to perform well on unseen data.

Underfitting occurs when a model is too simple to capture the underlying pattern, leading to high errors on both training and testing data. Overfitting happens when a model is too complex, memorizing noise in the training set and performing poorly on new data.

The ideal model strikes a balance between bias and variance, capturing the true relationship while ignoring noise. Techniques like regularization help manage model complexity and prevent overfitting when training data is limited.