MLlib Batch Pipeline

Learn about machine learning libraries in PySpark to build predictive models.

Now that we’ve covered loading and transforming data with PySpark, we can use the machine learning libraries in PySpark to build a predictive model.

MLlib

The core library for building predictive models in PySpark is called MLlib. This library provides a suite of supervised and unsupervised algorithms.

While this library does not have complete coverage of all of the algorithms in sklearn, it provides functionality for the majority of the types of operations needed for data science workflows. In this section, we’ll show you how to apply MLlib to a classification problem and save the outputs from the model application to a data lake.

Get hands-on with 1200+ tech skills courses.