Distributed data sources

To build scalable data pipelines, we’ll need to switch from using local files, such as CSVs, to distributed data sources, such as Parquet files on S3. While the tools used across cloud platforms to load data vary significantly, the end result is usually the same, which is a dataframe. In a single machine environment, we can use Pandas to load the dataframe, while distributed environments use different implementations such as Spark dataframes in PySpark.

In this lesson, we will introduce the datasets that we’ll explore throughout the course. In this chapter, we’ll focus on loading the data using a single machine, while later chapters will present distributed approaches. Although most of the datasets presented here can be downloaded as CSV files and can be read into Pandas using read_csv, it’s good practice to develop automated workflows to connect to diverse data sources.

Common datasets

We’ll explore the following datasets throughout this course:

Boston Housing: records of sale prices of homes in the Boston housing market back in 1980
Game Purchases: a synthetic dataset representing games purchased by different users on XBox One
Natality: one of BigQuery’s open datasets on birth statistics in the US over multiple decades
Kaggle NHL: play-by-play events from professional hockey games and game statistics over the past decade

The first two datasets only need a single command to load them, as long as you have the required libraries installed.

The Natality and Kaggle NHL datasets require setting up authentication files before you programmatically pull the data sources into Pandas.

Press + to interact

Introduction to Building Scalable Model Pipelines

Models as Web Endpoints

Models as Serverless Functions

Create an Echo Function in Lambda

Working with S3 in Lambda

Working with API in Lambda

Containers for Reproducible Models

Working with AWS Container Registry

Workflow Tools for Model Pipelines

PySpark for Batch Pipelines

Cloud Dataflow for Batch Modeling

Streaming Model Workflows

Course Conclusion

Introduction to Datasets

Distributed data sources

Common datasets

Load data from a library

Load data from web

Try it out! #