Search⌘ K

Data Comes First

Explore the role of data in machine learning by studying the MNIST dataset, a collection of labeled handwritten digits used for supervised learning. Learn why separating training and testing data prevents overfitting and allows your system to generalize to new images. This lesson helps you understand the fundamentals of image classification and prepares you for hands-on work with real datasets.

About data

It is time to leap from simple pizza-related examples to image recognition. We are about to do something magical. Within a few lessons, we’ll have a program that classifies images.

In this chapter and the next, we’ll apply our binary classifier to MNIST. It is a database of handwritten digits. A few years ago, before ML systems could tackle more complex datasets, AI researchers used MNIST as a benchmark for their algorithms. This will be our first experience with computer vision, so let’s start with small steps. We’ll begin by recognizing a single MNIST digit and leave more general character recognition to the next chapter.

Before we feed data to our ML system, let’s get up close and personal with that data. This section tells us all we need to know about MNIST.

Getting to know MNIST

MNIST is a collection of labeled images that has been assembled specifically for supervised learning. Its name stands for “Modified NIST” because it’s a remix of earlier data from the National Institute of Standards and Technology. ...