Spectral clustering algorithm implemented from scratch

Spectral clustering is a powerful technique for clustering data based on the spectral properties of the data matrix. It uses the eigenvalues and eigenvectors of the similarity matrix to partition the data into meaningful clusters. Unlike traditional clustering algorithms, spectral clustering is particularly effective in capturing complex structures and handling nonconvex shapes. Here, we will implement spectral clustering from scratch using only the NumPy library in Python.

Algorithm

Spectral clustering involves the following steps:

Use the Gaussian similarity measure to calculate pairwise similarities and form the affinity matrix $A$ , where, $A_{ij} = \exp\left(-\frac{2}{{\sigma^2}} \lVert x_i - x_j \rVert^2\right)$ .
Calculate the degree matrix $D,$ which is a diagonal and the un-normalized Laplacian matrix $L$ , such that $D_{ii}=\sum_j A_{ij},$ and $L= D - A$ .
Perform eigenvalue decomposition on the Laplacian matrix $L$ and select the $k$ smallest eigenvalues and their corresponding eigenvectors.
Apply k-means clustering on the eigenvectors to assign data points to clusters based on similarities.

Lets implement this algorithm as follows:

Generate synthetic data

Let’s generate a synthetic dataset using the make_blobs function from scikit-learn and visualize the initial dataset.

Create the affinity matrix

Next, we focus on constructing the affinity matrix, a fundamental component in spectral clustering. The affinity matrix captures pairwise similarities between data points, forming the basis for spectral clustering. We employ a Gaussian similarity measure, where the value of sigma influences the width of the Gaussian kernel. The gaussian_similarity function computes the Gaussian similarity between two data points, while the create_affinity_matrix function utilizes this measure to create the full affinity matrix for the given dataset. The resulting matrix reflects the strength of connections between data points, essential for subsequent steps in spectral clustering.

# Step 1: Create the affinity matrix
def gaussian_similarity(x_i, x_j, sigma=1.0):
    return np.exp(-np.sum((x_i - x_j)**2) / (2 * sigma**2))
def create_affinity_matrix(data, sigma=1.0):
    n_samples = data.shape[0]
    affinity_matrix = np.zeros((n_samples, n_samples))
    # Calculate pairwise similarities using Gaussian similarity measure
    for i in range(n_samples):
        for j in range(n_samples):
            affinity_matrix[i, j] = gaussian_similarity(data[i], data[j], sigma)
    return affinity_matrix
# Set the value of sigma for the Gaussian similarity measure
sigma_value = 1.0
# Create the affinity matrix
affinity_matrix = create_affinity_matrix(data, sigma=sigma_value)
# Visualize the affinity matrix
plt.imshow(affinity_matrix, cmap='viridis')
plt.colorbar()
plt.show()

# Step 2: Compute the Laplacian matrix
def compute_laplacian(affinity_matrix):
    # Compute the degree matrix
    degree_matrix = np.diag(np.sum(affinity_matrix, axis=1))
    # Compute the unnormalized Laplacian
    laplacian_matrix = degree_matrix - affinity_matrix
    return laplacian_matrix, degree_matrix
# Compute the Laplacian matrix and degree matrix
laplacian_matrix, degree_matrix = compute_laplacian(affinity_matrix)
# Visualize the Laplacian matrix
plt.imshow(laplacian_matrix, cmap='viridis')
plt.colorbar()
plt.show()

Eigenvalue decomposition

This process involves analyzing the Laplacian matrix we computed earlier. Eigenvalue decomposition unveils essential characteristics of our dataset that aid in the identification of clusters.

The code below takes the Laplacian matrix and breaks it down into eigenvalues and their corresponding eigenvectors. These eigenvalues are then sorted, allowing us to examine them more effectively. The resulting plot of sorted eigenvalues provides valuable insights into the inherent structure of our data.

# Step 4: Spectral clustering
num_clusters = 3  # Adjust the number of clusters as needed
# Select informative eigenvectors based on sorted eigenvalues
selected_eigenvectors = eigenvectors[:, :num_clusters]
# Apply k-means or any clustering algorithm on the selected eigenvectors
from sklearn.cluster import KMeans
# Use k-means clustering on the selected eigenvectors
kmeans = KMeans(n_clusters=num_clusters, random_state=42)
cluster_labels = kmeans.fit_predict(selected_eigenvectors)
# Visualize the spectral clustering result
plt.scatter(data[:, 0], data[:, 1], c=cluster_labels, cmap='viridis', edgecolors='k', marker='o', s=100)
plt.show()

Conclusion

In spectral clustering, we’re using the information encoded in the eigenvectors corresponding to the sorted eigenvalues. The key distinction from k-means clustering here is that we’re not clustering the original data points directly as it involves a two-step process: obtaining informative eigenvectors through eigenvalue decomposition and then applying k-means to these transformed points. This approach allows spectral clustering to capture and utilize more intricate patterns within the data.

Free AI Mock Interviews

Coding Interview

Coding PatternsFree Interview

Gain insights and practical experience with coding patterns through targeted MCQs and coding problems, designed to match and challenge your expertise level.

System Design

YouTubeFree Interview

Learn to design a video streaming platform like YouTube by tackling functional and non-functional requirements, core components, and high-level to detailed design challenges.

Free Resources