Spectral clustering is a powerful technique for clustering data based on the spectral properties of the data matrix. It uses the eigenvalues and eigenvectors of the similarity matrix to partition the data into meaningful clusters. Unlike traditional clustering algorithms, spectral clustering is particularly effective in capturing complex structures and handling nonconvex shapes. Here, we will implement spectral clustering from scratch using only the NumPy library in Python.
Spectral clustering involves the following steps:
Use the Gaussian similarity measure to calculate pairwise similarities and form the affinity matrix
Calculate the degree matrix
Perform eigenvalue decomposition on the Laplacian matrix
Apply k-means clustering on the eigenvectors to assign data points to clusters based on similarities.
Lets implement this algorithm as follows:
Let’s generate a synthetic dataset using the make_blobs
function from scikit-learn and visualize the initial dataset.
import numpy as npimport matplotlib.pyplot as pltfrom sklearn.datasets import make_blobs# Step 0: Generate synthetic datadata, true_labels = make_blobs(n_samples=200, centers=3, random_state=42)# Visualize the initial datasetplt.scatter(data[:, 0], data[:, 1])plt.show()
Next, we focus on constructing the affinity matrix, a fundamental component in spectral clustering. The affinity matrix captures pairwise similarities between data points, forming the basis for spectral clustering. We employ a Gaussian similarity measure, where the value of sigma influences the width of the Gaussian kernel. The gaussian_similarity
function computes the Gaussian similarity between two data points, while the create_affinity_matrix
function utilizes this measure to create the full affinity matrix for the given dataset. The resulting matrix reflects the strength of connections between data points, essential for subsequent steps in spectral clustering.
# Step 1: Create the affinity matrixdef gaussian_similarity(x_i, x_j, sigma=1.0):return np.exp(-np.sum((x_i - x_j)**2) / (2 * sigma**2))def create_affinity_matrix(data, sigma=1.0):n_samples = data.shape[0]affinity_matrix = np.zeros((n_samples, n_samples))# Calculate pairwise similarities using Gaussian similarity measurefor i in range(n_samples):for j in range(n_samples):affinity_matrix[i, j] = gaussian_similarity(data[i], data[j], sigma)return affinity_matrix# Set the value of sigma for the Gaussian similarity measuresigma_value = 1.0# Create the affinity matrixaffinity_matrix = create_affinity_matrix(data, sigma=sigma_value)# Visualize the affinity matrixplt.imshow(affinity_matrix, cmap='viridis')plt.colorbar()plt.show()
The Laplacian matrix encapsulates the local structure of the dataset, aiding in the identification of clusters. The code snippet below demonstrates the computation of the Laplacian matrix using the affinity matrix. Additionally, the Laplacian matrix’s properties, such as the degree matrix and the un-normalized Laplacian, are outlined to elucidate its significance in the spectral clustering process.
# Step 2: Compute the Laplacian matrixdef compute_laplacian(affinity_matrix):# Compute the degree matrixdegree_matrix = np.diag(np.sum(affinity_matrix, axis=1))# Compute the unnormalized Laplacianlaplacian_matrix = degree_matrix - affinity_matrixreturn laplacian_matrix, degree_matrix# Compute the Laplacian matrix and degree matrixlaplacian_matrix, degree_matrix = compute_laplacian(affinity_matrix)# Visualize the Laplacian matrixplt.imshow(laplacian_matrix, cmap='viridis')plt.colorbar()plt.show()
This process involves analyzing the Laplacian matrix we computed earlier. Eigenvalue decomposition unveils essential characteristics of our dataset that aid in the identification of clusters.
The code below takes the Laplacian matrix and breaks it down into eigenvalues and their corresponding eigenvectors. These eigenvalues are then sorted, allowing us to examine them more effectively. The resulting plot of sorted eigenvalues provides valuable insights into the inherent structure of our data.
# Step 3: Eigenvalue decompositioneigenvalues, eigenvectors = np.linalg.eigh(laplacian_matrix)# Sort eigenvalues and corresponding eigenvectorssorted_indices = np.argsort(eigenvalues)eigenvalues = eigenvalues[sorted_indices]eigenvectors = eigenvectors[:, sorted_indices]# Visualize sorted eigenvaluesplt.plot(eigenvalues, marker='o', linestyle='-', color='b')plt.xlabel('Eigenvalue Index')plt.ylabel('Eigenvalue Magnitude')plt.show()
With the eigenvalue decomposition laying the groundwork, we proceed to the final step—spectral clustering. This step involves selecting informative eigenvectors based on the sorted eigenvalues to form a new matrix. By applying standard clustering techniques to this matrix, we effectively partition the dataset into distinct clusters
# Step 4: Spectral clusteringnum_clusters = 3 # Adjust the number of clusters as needed# Select informative eigenvectors based on sorted eigenvaluesselected_eigenvectors = eigenvectors[:, :num_clusters]# Apply k-means or any clustering algorithm on the selected eigenvectorsfrom sklearn.cluster import KMeans# Use k-means clustering on the selected eigenvectorskmeans = KMeans(n_clusters=num_clusters, random_state=42)cluster_labels = kmeans.fit_predict(selected_eigenvectors)# Visualize the spectral clustering resultplt.scatter(data[:, 0], data[:, 1], c=cluster_labels, cmap='viridis', edgecolors='k', marker='o', s=100)plt.show()
In spectral clustering, we’re using the information encoded in the eigenvectors corresponding to the sorted eigenvalues. The key distinction from k-means clustering here is that we’re not clustering the original data points directly as it involves a two-step process: obtaining informative eigenvectors through eigenvalue decomposition and then applying k-means to these transformed points. This approach allows spectral clustering to capture and utilize more intricate patterns within the data.
Free Resources