Regression with PySpark MLlib

Learn how to use the linear regression algorithm using PySpark MLlib.

Regression is a supervised learning method primarily used for predicting continuous variables. Let’s explore some of the most common regression algorithms that are available in PySpark MLlib.

Linear regression

Linear regression is a type of regression where the goal is to predict a continuous variable based on the features of the input data or samples. It is a straightforward yet powerful approach to supervised ML. In the linear regression method, the mapping function maps a predictor variable from a given sample to the response variable or labels (dependent variable). The variables in the input are also called independent variables, explanatory variables, or features. The output of the mapping function is then compared to the original target value, and loss is calculated. If the loss is high, then the model is modified until the loss is zero or minimal.

The mapping functions also have parameters that help map input data to the predicted values. The optimization works on the parameters and tries to minimize the difference between the predicted output and the original value. This is the main idea of a cost function for regression. Some of the well-known cost functions for linear regression are root mean square error (RMSE), mean absolute error (MAE), R-squared, and so on.

Linear regression is used mainly for predictive analytics, such as foreseeing trends, determining future values, and predicting the impact of future changes. Some of the real-world examples of linear regression include:

  • Predicting crop yields based on environmental conditions such as rainfall.
  • Evaluating the impact of product prices on sales.
  • Forecasting future sales in upcoming months based on the company’s historical sales.
  • House price predictions based on features such as square footage, number of bedrooms, location, and so on.

Get hands-on with 1200+ tech skills courses.