Clustering with PyCaret

Learn how to import necessary libraries and generate datasets for clustering with PyCaret.

One of the fundamental tasks in unsupervised machine learning is clustering. This task aims to categorize instances of a given dataset in different clusters based on their common characteristics. Clustering has many practical applications in various fields such as market research, social network analysis, bioinformatics, medicine, and others. The k-means clustering method is a simple and widely used method. It is defined in the following formula:

minC1,,CK{k=1KW(Ck)}\min_{C_{1}, \ldots, C_{K}}\left\{\sum_{k=1}^{K} W(C_{k})\right\}

KK is the number of all clusters, while CkC_{k} represents each cluster. Our goal is to minimize WW, which is the measure of within-cluster variation.

W(Ck)=1Cki,iCkj=1p(xijxij)2W\left(C_{k}\right)=\frac{1}{\left|C_{k}\right|} \sum_{i, i^{\prime} \in C_{k}} \sum_{j=1}^{p}\left(x_{i j}-x_{i^{\prime} j}\right)^{2}

There are various ways to define within-cluster variation, but the most common is squared euclidean distance as we can see in the above equation. This results in the following form of kk-means clustering:

minC1,,CK{k=1K1Cki,iCkj=1p(xijxij)2}\min_{C_{1}, \ldots, C_{K}} \left\{ \sum_{k=1}^{K} \frac{1}{\left|C_{k}\right|} \sum_{i, i^{\prime} \in C_{k}} \sum_{j=1}^{p}\left(x_{i j}-x_{i^{\prime} j}\right)^{2} \right\}

Get hands-on with 1200+ tech skills courses.