# Clustering with PyCaret

Learn how to import necessary libraries and generate datasets for clustering with PyCaret.

## We'll cover the following

One of the fundamental tasks in unsupervised machine learning is **clustering**. This task aims to categorize instances of a given dataset in different clusters based on their common characteristics. Clustering has many practical applications in various fields such as market research, social network analysis, bioinformatics, medicine, and others. The k-means clustering method is a simple and widely used method. It is defined in the following formula:

$\min_{C_{1}, \ldots, C_{K}}\left\{\sum_{k=1}^{K} W(C_{k})\right\}$

$K$ is the number of all clusters, while $C_{k}$ represents each cluster. Our goal is to minimize $W$, which is the measure of within-cluster variation.

$W\left(C_{k}\right)=\frac{1}{\left|C_{k}\right|} \sum_{i, i^{\prime} \in C_{k}} \sum_{j=1}^{p}\left(x_{i j}-x_{i^{\prime} j}\right)^{2}$

There are various ways to define within-cluster variation, but the most common is **squared euclidean distance** as we can see in the above equation. This results in the following form of $k$-means clustering:

$\min_{C_{1}, \ldots, C_{K}} \left\{ \sum_{k=1}^{K} \frac{1}{\left|C_{k}\right|} \sum_{i, i^{\prime} \in C_{k}} \sum_{j=1}^{p}\left(x_{i j}-x_{i^{\prime} j}\right)^{2} \right\}$

Get hands-on with 1200+ tech skills courses.