...

Regression with PySpark MLlib

Learn how to use the linear regression algorithm using PySpark MLlib.

We'll cover the following...

Linear regression
- Dataset
  - California housing prices (1990 Census)
    - Data description

Regression is a supervised learning method primarily used for predicting continuous variables. Let’s explore some of the most common regression algorithms that are available in PySpark MLlib.

Linear regression

Linear regression is a type of regression where the goal is to predict a continuous variable based on the features of the input data or samples. It is a straightforward yet powerful approach to supervised ML. In the linear regression method, the mapping function maps a predictor variable from a given sample to the response variable or labels (dependent variable). The variables in the input are also called independent variables, explanatory variables, or features. The output of the mapping function is then compared to the original target value, and loss is calculated. If the loss is high, then the model is modified until the loss is zero or minimal.

The mapping functions also have parameters that help map input data to the predicted values. The optimization works on the parameters and tries to minimize the difference between the predicted output and the original value. This is the main idea of a cost function for regression. Some of the well-known cost functions for linear regression are root mean square error (RMSE), mean absolute error (MAE), R-squared, and so on.

Linear regression is used mainly for predictive analytics, such as foreseeing trends, determining future values, and predicting the impact of future changes. Some of the real-world examples of linear regression include:

Predicting crop yields based on environmental conditions such as rainfall.
Evaluating the impact of product prices on sales.
Forecasting future sales in upcoming months based on the company’s historical sales.

...

Introduction to the Course

Introduction to Big Data

Exploring PySpark Core and RDDs

PySpark DataFrames and SQL

Customer Churn Analysis Using PySpark

Machine Learning with PySpark

Modeling with PySpark MLlib

Predicting Diabetes in Patients Using PySpark MLlib

Performance Optimization in PySpark

PySpark Optimization: Analyzing NYC Restaurants Data

Integrating PySpark with Other Big Data Tools

Wrap Up

Apriori Algorithm for Finding Frequent Itemsets with PySpark

Regression with PySpark MLlib

Linear regression