Classification with PySpark MLlib

Learn how to use the logistic regression algorithm using PySpark MLlib.

We'll cover the following

In machine learning, classification models play a crucial role when the task is to predict distinct categories or classes.

Logistic regression

One of the most commonly used algorithms for classification is logistic regression. It predicts the probability of different outcomes, making it an ideal choice for scenarios where we want to determine the likelihood of an event occurring or not. In PySpark MLlib, logistic regression can be used for binary classification (two distinct classes) using binomial logistic regression or for multiclass classification (more than two classes) using multinomial logistic regression. The specific variant is determined by setting the family parameter accordingly or leaving it unset, and Spark can infer it if the parameter is left unset.

Multinomial logistic regression can be used for binary classification by setting the family parameter to “multinomial.” The output will be two sets of coefficients and two intercepts.

Note: During logistic regression fitting without intercept on a dataset with the constant nonzero column, PySpark MLlib outputs zero coefficients for constant nonzero columns.

Get hands-on with 1200+ tech skills courses.