Clustering with H2O

Introduction to clustering models

Unsupervised clustering models group similar data points based on their inherent patterns or similarities. Their goal is to find natural groupings within the data without any predefined labels or categories.

To achieve well-defined clusters, the algorithm measures the similarity or distance between data points and groups that are close together. The goal of unsupervised clustering is to maximize intracluster similarity while maximizing intercluster dissimilarity. In other words, we want the data points within a cluster to be similar to each other, and at the same time, we want the clusters to be distinct from other clusters.

By optimizing these two factors, unsupervised clustering algorithms can identify meaningful and well-separated clusters in the data. They allow us to uncover underlying patterns, segment the data into distinct groups, and gain insights into the structure of the dataset.

Approaches to computing similarity

The underlying mathematics in clustering involves measuring the similarity or dissimilarity between data points. The most common approach is to use distance measures, such as Euclidean distanceA measure of how close two points or vectors are in a multidimensional space based on the straight-line distance between them. The smaller the Euclidean distance, the more similar the points, indicating they are closer together in space. or cosine similarityA measure of how similar two vectors or documents are based on the cosine of the angle between them. When the vectors are pointing in the same direction, the cosine similarity is closer to 1, indicating high similarity, and vice versa., to quantify the similarity between data points. These measures calculate the geometric or angular difference between the feature vectors representing the data points.

Once the similarity/dissimilarity is computed, clustering algorithms minimize the intracluster distanceDistance between data points within the same cluster. and maximize the intercluster distanceDistance between data points from different clusters..

Various algorithms, such as K-Means, hierarchical clustering, or density-based spatial clustering of applications with noise (DBSCAN), employ different mathematical techniques to assign data points to clusters iteratively.

K-Means

  • K-Means partitions the data into K-clusters, where each cluster is represented by its centroid.

  • The algorithm starts by randomly initializing K centroids.

  • It iteratively assigns each data point to the nearest centroid based on the Euclidean distance.

  • After all data points are assigned, the centroids are updated by calculating the mean of the assigned points.

  • This process continues until convergence, where the centroids no longer change significantly or a maximum number of iterations is reached.

Hierarchical clustering

  • Hierarchical clustering builds a treelike structure (dendrogram) by iteratively merging or splitting clusters.

  • It starts by considering each data point as a separate cluster.

  • The algorithm then merges the two closest clusters based on a chosen distance metric (e.g., Euclidean or Manhattan distanceManhattan distance is a measure of the distance between two points in a grid-based system that only allows horizontal and vertical movements, no diagonals.).

  • This process continues until all data points are in a single cluster or a specific number of clusters is reached.

  • The dendrogram can be cut at different levels to obtain different numbers of clusters.

DBSCAN

  • DBSCAN clusters data based on density connectivity rather than predefined clusters.

  • It defines clusters as areas of high density separated by areas of low density.

  • The algorithm starts with a randomly selected data point and forms a cluster around it by including neighboring points within a specified distance (epsilon).

  • It then expands the cluster by recursively adding new points that are also within the epsilon distance.

  • Points that are not within the epsilon distance of any existing cluster are labeled as noise or outliers.

H2OKMeansEstimator

K-Means is a clustering algorithm that belongs to the general category of clustering techniques. It begins by randomly selecting initial points and then iteratively converges toward a local minimum of centroids. The number of clusters to be generated is flexible and serves as a tuning parameter for analysis. The output of K-Means is a concise matrix that assigns data points to specific clusters and provides the coordinates of the cluster centers representing the chosen attributes.

Estimating k in K-Means

To estimate the number of clusters, k, the H2OKMeansEstimator goes through the following steps:

  1. Initiate with a single cluster and execute the K-Means algorithm to calculate the centroid.

  2. Identify the variable with the highest range from the training dataset and partition it at its mean.

  3. Apply the K-Means algorithm to the resulting two clusters.

  4. Identify the variable and cluster with the maximum range, then split that cluster based on the mean of the variable.

  5. Rerun the K-Means algorithm and repeat the process.

  6. Continue the iterative application of K-Means until the specified stopping criteria are met.

H2O uses proportional reduction in error (PRE) to determine when to stop splitting. The PRE value is calculated based on the sum of squares within (SSW):

Get hands-on with 1200+ tech skills courses.