Learn about different cloud environments and tools for building scalable data and model pipelines.

Become a Data Scientist

In this module, we’ll cover different cloud environments and tools for building scalable data and model pipelines. We’ll learn the different data sets and types of models that will be used heavily in everyday production. Throughout the module, we’ll have plenty of exercises and challenges to get us comfortable working with the diverse toolset. Lastly, we’ll explore streaming model workflows that are crucial for building real-time data pipelines.
By the end of this module, we'll have a better understanding of how to build scalable machine learning pipelines in a cloud environment.

Data Science in Production: Building Scalable Model Pipelines

## Distributed data sources
To build scalable data pipelines, we'll need to switch from using local files, such as CSVs, to distributed data sources, such as Parquet files on S3. While the tools used across cloud platforms to load data vary significantly, the end result is usually the same, which is a dataframe. In a single machine environment, we can use Pandas to load the dataframe, while distributed environments use different implementations such as Spark dataframes in PySpark. 

 In this lesson, we will introduce the datasets that we’ll explore throughout the course. 
In this chapter, we'll focus on loading the data using a single machine, while later chapters will present distributed approaches. Although most of the datasets presented here can be downloaded as CSV files and can be read into Pandas using read_csv, it’s good practice to develop automated workflows to connect to diverse data sources.




# Distributed data sources
To build scalable data pipelines, we'll need to switch from using local files, such as CSVs, to distributed data sources, such as Parquet files on S3. While the tools used across cloud platforms to load data vary significantly, the end result is usually the same, which is a dataframe. In a single machine environment, we can use Pandas to load the dataframe, while distributed environments use different implementations such as Spark dataframes in PySpark. 

 In this lesson, we will introduce the datasets that we’ll explore throughout the course. 
In this chapter, we'll focus on loading the data using a single machine, while later chapters will present distributed approaches. Although most of the datasets presented here can be downloaded as CSV files and can be read into Pandas using read_csv, it’s good practice to develop automated workflows to connect to diverse data sources.




An introduction to datasets and different ways to load them. 

Introduction to Building Scalable Model Pipelines

Models as Web Endpoints

Models as Serverless Functions

Containers for Reproducible Models

Workflow Tools for Model Pipelines

PySpark for Batch Pipelines

Cloud Dataflow for Batch Modeling

Streaming Model Workflows

Conclusion

Introduction to Datasets

Distributed data sources