Search⌘ K
AI Features

Datasets

Explore how to structure dataset creation within a machine learning pipeline by implementing abstract base classes and mixins in Python. Understand loading, preprocessing, feature engineering, and saving data, using the Iris dataset as an example to build a practical ML pipeline segment.

Let’s first review the architecture of the pipeline.

ML pipeline architecture
ML pipeline architecture

Let’s also revisit the diagram depicting encapsulation.

Diagram of the Dataset class
Diagram of the Dataset class

From these diagrams, we see that the first step in the development of the pipeline is to create an abstract base class called Dataset from which all other dataset classes derive. This base class should have four abstract methods—load, preprocess, feature_engineer, and save—corresponding to the blocks shown above. Note that each of these methods, with the exception of save, also corresponds to a task in the pipeline. The save method is a utility method that we’ll examine later.

The ML project we’re working on is iris classification, so we need a dataset for ...