Search⌘ K
AI Features

The Threat of Overfitting

Explore the concept of overfitting in supervised learning and understand why using separate data for training and testing is crucial. Learn how overfitting affects neural network accuracy and how the blind test rule helps maintain realistic performance estimates on unseen data.

Soon enough, we’ll get to tune our neural network and make it as accurate as possible. Before we do that, however, we need a reliable test to measure that accuracy. As it turns out, ML testing comes with a subtly counterintuitive hurdle that can easily complicate things .

Side note: This short chapter tells us where that testing trapdoor is, and how to step around it.

Overfitting

Since the first part of this course, we’ve been using two separate sets of examples: one for training our algorithms, and one for testing them. Let’s refresh our memory: why do we not use the same examples for both training and testing?

Let’s use a metaphor to answer that question. Imagine we teach basic math to a class of young kids. We’ve already prepared 60 multiple-answer multiplication quizzes. We plan to assign most of those quizzes as homework. We also plan to select 10 quizzes for a final test to check how well the kids are learning.

We want to split the quizzes into two groups:”homework” and “test”. Now, we have two options: either we assign all 60 quizzes to ”homework,” and then reuse 10 of them in ”test,” or we split the quizzes, assign 50 of them to ”homework”, and the remaining 10 to ”test.” Which option would we pick?

Most probably, we would go for the second option for one reason: we do not want the kids to memorize multiplications.We want them to understand the rules of multiplications.Therefore, we would rather test their knowledge on quizzes that they have not seen before. Otherwise, they might get the right answers because they remember those answers from their exercises.

The same reasoning applies when we train a supervised learning system. If we train a network to recognize photos of beagles, we want it to recognize any beagle picture, not just the specific ones it encountered in training. That’s easier said than done. Supervised learning systems, like people, have a tendency to memorize their training examples instead of generalizing from them. Back in Training vs. Testing, we called this problem overfitting.

To counter overfitting, we introduced the idea of a test set. Just like the teacher in our story, we train our neural networks on one set of examples and test them on a different set of examples. That’s how we get a realistic estimate of a network’s performance in production, where it will be faced with data that it’s never seen before.

Because of overfitting, the network tends to be more accurate on familiar training data and less accurate on unfamiliar test data. Conversely, the network’s loss tends to be lower on the training data and higher on the test data.

Tweak our neural network code

Let’s put those expectations to the test. We build a version of our neural network that tracks the loss and accuracy on both the training and the test set, iteration by iteration. Then we run this hacked network for 10,000 iterations, with batch GD (that is, putting all the examples in one large batch), n_hidden_nodes=200, and lr=0.01::

Note: Experiment with the given code widget by changing its hyperparameters to observe overfitting. We have set iterations to be 20 here, to get some results in a reasonable time. Better accuracy results of training the classifier for 10,000 iterations are shown below this code widget.

# The MNIST data loader that we used so far.
# It includes a training and a test set, but no validation set.

import numpy as np
import gzip
import struct


def load_images(filename):
    # Open and unzip the file of images:
    with gzip.open(filename, 'rb') as f:
        # Read the header information into a bunch of variables:
        _ignored, n_images, columns, rows = struct.unpack('>IIII', f.read(16))
        # Read all the pixels into a NumPy array of bytes:
        all_pixels = np.frombuffer(f.read(), dtype=np.uint8)
        # Reshape the pixels into a matrix where each line is an image:
        return all_pixels.reshape(n_images, columns * rows)


# 60000 images, each 784 elements (28 * 28 pixels)
X_train = load_images("/programming-machine-learning/data/mnist/train-images-idx3-ubyte.gz")

# 10000 images, each 784 elements, with the same structure as X_train
X_test = load_images("/programming-machine-learning/data/mnist/t10k-images-idx3-ubyte.gz")


def load_labels(filename):
    # Open and unzip the file of images:
    with gzip.open(filename, 'rb') as f:
        # Skip the header bytes:
        f.read(8)
        # Read all the labels into a list:
        all_labels = f.read()
        # Reshape the list of labels into a one-column matrix:
        return np.frombuffer(all_labels, dtype=np.uint8).reshape(-1, 1)


def one_hot_encode(Y):
    n_labels = Y.shape[0]
    n_classes = 10
    encoded_Y = np.zeros((n_labels, n_classes))
    for i in range(n_labels):
        label = Y[i]
        encoded_Y[i][label] = 1
    return encoded_Y


# 60K labels, each a single digit from 0 to 9
Y_train_unencoded = load_labels("/programming-machine-learning/data/mnist/train-labels-idx1-ubyte.gz")

# 60K labels, each consisting of 10 one-hot encoded elements
Y_train = one_hot_encode(Y_train_unencoded)

# 10000 labels, each a single digit from 0 to 9
Y_test = load_labels("/programming-machine-learning/data/mnist/t10k-labels-idx1-ubyte.gz")
Overfitting

After training the model for 10,000 iterations, we get the following results:

0 > Training loss: 2.43321 - Test loss: 2.42661
1 > Training loss: 2.38746 - Test loss: 2.38024

9999 > Training loss: 0.14669 - Test loss: 0.22979
Training accuracy: 96.13%, Test accuracy: 93.25%

Loss curve comparison for test and training sets

This chart shows how the two losses change during training:

The losses start on even ground, but they diverge soon. As the network produces through the training set, its loss on that set decreases faster than the loss on the test set. A lower loss generally means higher accuracy—and indeed, at the end of the training, the network nails 96% of the training examples, but only 93% of the test examples. If we did not have a test set, we would harbor the illusion that our error rate is below 4%, when in reality, it’s closer to 7%. That large difference is the effect of overfitting: the network is more accurate on training data just because those are the data it trained on.

Here, we call it The Blind Test Rule. It tests our system on data that it has not seen before. If we stick to this rule, we would not get disappointed by a neural network that’s less accurate on real-world data than it was on test data.

We might think that following the blind test rule is easy, but unfortunately, it’s easy to violate the rule by mistake. In fact, as we are about to find out, we already did.