Introduction to Datasets

An introduction to datasets and different ways to load them.

Distributed data sources

To build scalable data pipelines, we’ll need to switch from using local files, such as CSVs, to distributed data sources, such as Parquet files on S3. While the tools used across cloud platforms to load data vary significantly, the end result is usually the same, which is a dataframe. In a single machine environment, we can use Pandas to load the dataframe, while distributed environments use different implementations such as Spark dataframes in PySpark.

In this lesson, we will introduce the datasets that we’ll explore throughout the course. In this chapter, we’ll focus on loading the data using a single machine, while later chapters will present distributed approaches. Although most of the datasets presented here can be downloaded as CSV files and can be read into Pandas using read_csv, it’s good practice to develop automated workflows to connect to diverse data sources.

Common datasets

We’ll explore the following datasets throughout this course:

  • Boston Housing: records of sale prices of homes in the Boston housing market back in 1980
  • Game Purchases: a synthetic dataset representing games purchased by different users on XBox One
  • Natality: one of BigQuery’s open datasets on birth statistics in the US over multiple decades
  • Kaggle NHL: play-by-play events from professional hockey games and game statistics over the past decade

The first two datasets only need a single command to load them, as long as you have the required libraries installed.

The Natality and Kaggle NHL datasets require setting up authentication files before you programmatically pull the data sources into Pandas.

Load data from a library

The first approach we’ll use to load a dataset is retrieving it directly from a library. Multiple libraries include the Boston housing dataset because it is a small dataset that is useful for testing out regression models. We’ll load it from scikit-learn by first running pip from the command line:

In our pre-configured execution environment below, these libraries are already installed.

pip3 install pandas==1.3.5
pip3 install sklearn>=1.0.2

Once scikit-learn is installed, we can switch back to the Jupyter notebook to explore the dataset. The code snippet below shows how to load the scikit-learn and Pandas libraries, load the Boston dataset as a Pandas dataframe, and display the first 5 records:

from sklearn.datasets import load_boston
import pandas as pd
import numpy as np
data_url = "http://lib.stat.cmu.edu/datasets/boston"
raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
target = raw_df.values[1::2, 2]
bostonDF = pd.DataFrame(data,columns=['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO','B', 'LSTAT'])
bostonDF['label'] = target
bostonDF.head()

The result of running these commands is shown in the figure below:

Boston Housing dataset
Boston Housing dataset

Load data from web

The second approach we’ll use to load a dataset is fetching it from the web. The CSV for the Games dataset is available as a single file on GitHub. We can fetch it into a Pandas dataframe by using the read_csv function and passing the URL of the file as a parameter.

import pandas as pd
gamesDF = pd.read_csv("https://github.com/bgweber/Twitch/raw/master/Recommendations/games-expand.csv")
gamesDF.head()

Both of these approaches are similar to downloading CSV files and reading them from a local directory, but by using these methods, we can avoid the manual step of downloading files.

This behavior is useful to avoid in order to build automated workflows in Python.

The result of reading the dataset and printing out the first few records is shown in the figure below:

Game Purchases dataset
Game Purchases dataset

Try it out! #

Let’s try this out in the Jupyter notebook given below:

Please login to launch live app!