Introduction to Datasets

Explore methods to load diverse datasets in scalable model pipelines, using local files and distributed sources. Understand how to automate data retrieval with Python libraries, and prepare dataframes for production workflows.

We'll cover the following...

Distributed data sources
Common datasets

Load data from a library
Load data from web

Try it out!

Distributed data sources

To build scalable data pipelines, we’ll need to switch from using local files, such as CSVs, to distributed data sources, such as Parquet files on S3. While the tools used across cloud platforms to load data vary significantly, the end result is usually the same, which is a dataframe. In a single machine environment, we can use Pandas to load the dataframe, while distributed environments use different implementations such as Spark dataframes in PySpark.

In this lesson, we will introduce the datasets that we’ll explore throughout the course. In this chapter, we’ll focus on loading the data using a single machine, while later chapters will present distributed approaches. Although most of the datasets presented here can be downloaded as CSV files and can be read into Pandas using read_csv, it’s good practice to develop automated workflows to connect to diverse data sources.

Common datasets

We’ll explore the following datasets throughout this course:

Boston Housing: records of sale prices of homes in the Boston housing market back in 1980
Game Purchases: a synthetic dataset representing games purchased by different users on XBox One
Natality: one of BigQuery’s open datasets on birth statistics in the US over multiple decades
Kaggle NHL: play-by-play events from professional hockey games and game statistics over the past decade

The first two datasets only need a single command to load them, as long as you have the required libraries installed.

The Natality and Kaggle NHL datasets require setting up authentication files before you programmatically pull the data sources into Pandas.

1.Introduction to Building Scalable Model Pipelines

2.Models as Web Endpoints

3.Models as Serverless Functions

Cloud Lab

Cloud Lab

Cloud Lab

4.Containers for Reproducible Models

Cloud Lab

5.Workflow Tools for Model Pipelines

6.PySpark for Batch Pipelines

7.Cloud Dataflow for Batch Modeling

8.Streaming Model Workflows

9.Course Conclusion

Introduction to Datasets

Distributed data sources

Common datasets

Load data from a library

Load data from web

Try it out!