Spectral clustering algorithm implemented from scratch

Spectral clustering is a powerful technique for clustering data based on the spectral properties of the data matrix. It uses the eigenvalues and eigenvectors of the similarity matrix to partition the data into meaningful clusters. Unlike traditional clustering algorithms, spectral clustering is particularly effective in capturing complex structures and handling nonconvex shapes. Here, we will implement spectral clustering from scratch using only the NumPy library in Python.

Algorithm

Spectral clustering involves the following steps:

  • Use the Gaussian similarity measure to calculate pairwise similarities and form the affinity matrixAA, where, Aij=exp(2σ2xixj2)A_{ij} = \exp\left(-\frac{2}{{\sigma^2}} \lVert x_i - x_j \rVert^2\right).

  • Calculate the degree matrixD,D,which is a diagonal and the un-normalized Laplacian matrixLL, such that Dii=jAij,D_{ii}=\sum_j A_{ij}, and L=DAL= D - A.

  • Perform eigenvalue decomposition on the Laplacian matrixLLand select thekksmallest eigenvalues and their corresponding eigenvectors.

  • Apply k-means clustering on the eigenvectors to assign data points to clusters based on similarities.

Lets implement this algorithm as follows:

Generate synthetic data

Let’s generate a synthetic dataset using the make_blobs function from scikit-learn and visualize the initial dataset.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
# Step 0: Generate synthetic data
data, true_labels = make_blobs(n_samples=200, centers=3, random_state=42)
# Visualize the initial dataset
plt.scatter(data[:, 0], data[:, 1])
plt.show()

Create the affinity matrix

Next, we focus on constructing the affinity matrix, a fundamental component in spectral clustering. The affinity matrix captures pairwise similarities between data points, forming the basis for spectral clustering. We employ a Gaussian similarity measure, where the value of sigma influences the width of the Gaussian kernel. The gaussian_similarity function computes the Gaussian similarity between two data points, while the create_affinity_matrix function utilizes this measure to create the full affinity matrix for the given dataset. The resulting matrix reflects the strength of connections between data points, essential for subsequent steps in spectral clustering.

# Step 1: Create the affinity matrix
def gaussian_similarity(x_i, x_j, sigma=1.0):
return np.exp(-np.sum((x_i - x_j)**2) / (2 * sigma**2))
def create_affinity_matrix(data, sigma=1.0):
n_samples = data.shape[0]
affinity_matrix = np.zeros((n_samples, n_samples))
# Calculate pairwise similarities using Gaussian similarity measure
for i in range(n_samples):
for j in range(n_samples):
affinity_matrix[i, j] = gaussian_similarity(data[i], data[j], sigma)
return affinity_matrix
# Set the value of sigma for the Gaussian similarity measure
sigma_value = 1.0
# Create the affinity matrix
affinity_matrix = create_affinity_matrix(data, sigma=sigma_value)
# Visualize the affinity matrix
plt.imshow(affinity_matrix, cmap='viridis')
plt.colorbar()
plt.show()

Compute the Laplacian matrix

The Laplacian matrix encapsulates the local structure of the dataset, aiding in the identification of clusters. The code snippet below demonstrates the computation of the Laplacian matrix using the affinity matrix. Additionally, the Laplacian matrix’s properties, such as the degree matrix and the un-normalized Laplacian, are outlined to elucidate its significance in the spectral clustering process.

# Step 2: Compute the Laplacian matrix
def compute_laplacian(affinity_matrix):
# Compute the degree matrix
degree_matrix = np.diag(np.sum(affinity_matrix, axis=1))
# Compute the unnormalized Laplacian
laplacian_matrix = degree_matrix - affinity_matrix
return laplacian_matrix, degree_matrix
# Compute the Laplacian matrix and degree matrix
laplacian_matrix, degree_matrix = compute_laplacian(affinity_matrix)
# Visualize the Laplacian matrix
plt.imshow(laplacian_matrix, cmap='viridis')
plt.colorbar()
plt.show()

Eigenvalue decomposition

This process involves analyzing the Laplacian matrix we computed earlier. Eigenvalue decomposition unveils essential characteristics of our dataset that aid in the identification of clusters.

The code below takes the Laplacian matrix and breaks it down into eigenvalues and their corresponding eigenvectors. These eigenvalues are then sorted, allowing us to examine them more effectively. The resulting plot of sorted eigenvalues provides valuable insights into the inherent structure of our data.

# Step 3: Eigenvalue decomposition
eigenvalues, eigenvectors = np.linalg.eigh(laplacian_matrix)
# Sort eigenvalues and corresponding eigenvectors
sorted_indices = np.argsort(eigenvalues)
eigenvalues = eigenvalues[sorted_indices]
eigenvectors = eigenvectors[:, sorted_indices]
# Visualize sorted eigenvalues
plt.plot(eigenvalues, marker='o', linestyle='-', color='b')
plt.xlabel('Eigenvalue Index')
plt.ylabel('Eigenvalue Magnitude')
plt.show()

K-means clustering on eigenvectors

With the eigenvalue decomposition laying the groundwork, we proceed to the final step—spectral clustering. This step involves selecting informative eigenvectors based on the sorted eigenvalues to form a new matrix. By applying standard clustering techniques to this matrix, we effectively partition the dataset into distinct clusters

# Step 4: Spectral clustering
num_clusters = 3 # Adjust the number of clusters as needed
# Select informative eigenvectors based on sorted eigenvalues
selected_eigenvectors = eigenvectors[:, :num_clusters]
# Apply k-means or any clustering algorithm on the selected eigenvectors
from sklearn.cluster import KMeans
# Use k-means clustering on the selected eigenvectors
kmeans = KMeans(n_clusters=num_clusters, random_state=42)
cluster_labels = kmeans.fit_predict(selected_eigenvectors)
# Visualize the spectral clustering result
plt.scatter(data[:, 0], data[:, 1], c=cluster_labels, cmap='viridis', edgecolors='k', marker='o', s=100)
plt.show()

Conclusion

In spectral clustering, we’re using the information encoded in the eigenvectors corresponding to the sorted eigenvalues. The key distinction from k-means clustering here is that we’re not clustering the original data points directly as it involves a two-step process: obtaining informative eigenvectors through eigenvalue decomposition and then applying k-means to these transformed points. This approach allows spectral clustering to capture and utilize more intricate patterns within the data.

Free Resources

Copyright ©2025 Educative, Inc. All rights reserved