...

/

Introduction to Datasets

Introduction to Datasets

An introduction to datasets and different ways to load them.

Distributed data sources

To build scalable data pipelines, we’ll need to switch from using local files, such as CSVs, to distributed data sources, such as Parquet files on S3. While the tools used across cloud platforms to load data vary significantly, the end result is usually the same, which is a dataframe. In a single machine environment, we can use Pandas to load the dataframe, while distributed environments use different implementations such as Spark dataframes in PySpark.

In this lesson, we will introduce the datasets that we’ll explore throughout the course. In this chapter, we’ll focus on loading the data using a single machine, while later chapters will present distributed approaches. ...