Data
Understand the importance of data for the machine learning process pipeline.
The ML process
For any complex problem that requires the computer to be able to identify patterns, there is an ML process to solve it.
This chapter demystifies each step of this process one by one. This lesson is about the first step—data.
Data in the classification problem
The first step in the human as well as machine learning pipeline is looking at data. In our previous lessons, we have learned that data could be of different types. For classifying the galaxy images and developing the image tagger application, we needed images in our dataset. Similarly, for the music identifier app, the dataset included a list of sound files. For the language translation app, we used a list of sentences as text in a certain language. Therefore, we now understand that for solving any machine learning problem, the first step is to acquire data.
It is the data where patterns need to be identified.
All data we have seen in previous examples consists of a set of input-output pairs. For classification problems, each image, such as in galaxy-type identification, is labeled with its corresponding class. The same structure can be observed in the photo tagger, music identifier, and language translation applications.
This structured data (input-output pair) serves as the foundation of a specific type of learning called ...