Unsupervised Learning with PySpark MLlib
Explore unsupervised learning with PySpark MLlib focusing on K-means clustering. Learn to preprocess text data, create clusters, choose optimal cluster numbers, and evaluate results using the Silhouette score.
We'll cover the following...
In addition to supervised learning algorithms like regression and classification that we explored in previous lessons, PySpark’s MLlib offers robust support for unsupervised learning algorithms. Unsupervised learning is particularly valuable when dealing with unlabeled data because it allows us to discover hidden patterns, structures, or groupings within the data. In this lesson, we’ll delve into one of the most widely used unsupervised learning methods: K-means clustering.
Introduction to K-means clustering
K-means clustering is a powerful unsupervised learning technique designed to uncover underlying patterns within data by grouping similar samples together ...