Unsupervised Learning with PySpark MLlib

Explore unsupervised learning with PySpark MLlib focusing on K-means clustering. Learn to preprocess text data, create clusters, choose optimal cluster numbers, and evaluate results using the Silhouette score.

We'll cover the following...

Introduction to K-means clustering
K-means clustering with PySpark MLlib

In addition to supervised learning algorithms like regression and classification that we explored in previous lessons, PySpark’s MLlib offers robust support for unsupervised learning algorithms. Unsupervised learning is particularly valuable when dealing with unlabeled data because it allows us to discover hidden patterns, structures, or groupings within the data. In this lesson, we’ll delve into one of the most widely used unsupervised learning methods: K-means clustering.

Introduction to K-means clustering

K-means clustering is a powerful unsupervised learning technique designed to uncover underlying patterns within data by grouping similar samples together ...

1.Introduction to the Course

2.Introduction to Big Data

3.Exploring PySpark Core and RDDs

4.PySpark DataFrames and SQL

5.Customer Churn Analysis Using PySpark

6.Machine Learning with PySpark

7.Modeling with PySpark MLlib

8.Predicting Diabetes in Patients Using PySpark MLlib

9.Performance Optimization in PySpark

10.PySpark Optimization: Analyzing NYC Restaurants Data

11.Integrating PySpark with Other Big Data Tools

12.Wrap Up

Project

Unsupervised Learning with PySpark MLlib

Introduction to K-means clustering