Reproducibility

Understand the considerations for reproducibility in the pipeline.

We'll cover the following...

Causes of nonreproducibility
Enabling reproducibility

Reproducibility is of paramount importance in science, and that’s also true when it comes to data science. A model trained on a given dataset a second time, with exactly the same preprocessing and feature engineering steps and hyperparameters, should perform almost—if not exactly—the same as the first model.

Traditional software programs are deterministic and, in general, will always output the same thing if the input is fixed. But ML systems are stochastic in nature, so this isn’t the case, and it takes some effort to achieve reproducibility. Before we discuss how we can achieve reproducibility in our ML pipeline, let’s discuss the causes of nonreproducibility.

1.Introduction

2.Getting Started

3.Structuring the ML Pipeline

4.Directed Acyclic Graphs (DAGs)

5.The ML Library

Project

6.The Pipeline Core

7.Extending the Pipeline

Project

8.Testing

9.Deployment

10.Other Considerations

11.Wrapping Up

12.Appendix

Assessment

Reproducibility

Causes of nonreproducibility