Apriori Algorithm for Finding Frequent Itemsets with PySpark

Let’s say we run a grocery store and have a good amount of data from the point of sale. We want the sets of items frequently bought together to be placed on shelves near each other to boost sales and increase customer convenience. To achieve this, we can use the Apriori algorithm. It’s much faster than its brute-force variant and can be implemented in a distributed computing scenario.

We’ll first write the Python code for the parallel processing of dataset partitions at the worker nodes. We’ll then write the final central itemset frequency check by the master node. The code we’ll write can be run on a compute cluster for a full flavor of distributed computing.

1.Introduction to the Course

2.Introduction to Big Data

3.Exploring PySpark Core and RDDs

4.PySpark DataFrames and SQL

5.Customer Churn Analysis Using PySpark

6.Machine Learning with PySpark

7.Modeling with PySpark MLlib

8.Predicting Diabetes in Patients Using PySpark MLlib

9.Performance Optimization in PySpark

10.PySpark Optimization: Analyzing NYC Restaurants Data

11.Integrating PySpark with Other Big Data Tools

12.Wrap Up

Project

Apriori Algorithm for Finding Frequent Itemsets with PySpark