K-Means Clustering
Explore interview prep questions centered around k-means clustering for unsupervised learning.
We'll cover the following...
K-means clustering is a fundamental tool in unsupervised learning for grouping similar data points without prior labels. In this lesson, we'll practice implementing the algorithm step-by-step, understand how to form meaningful clusters, and evaluate clustering performance with silhouette scores. Let’s get started.
Implementing k-means clustering
You are given a dataset containing various data points representing customer transactions. Your task is to group these transactions into different clusters based on their similarities using the k-means clustering algorithm. The dataset is represented as a list of tuples, where each tuple contains transaction details like the amount spent and the number of items purchased.
Implement a function k_means_clustering(data, k)
that clusters the given dataset into k
clusters using the k-means algorithm. The function should return a list of clusters, where each cluster is a list of transaction points.
This question is frequently asked in ML engineer interviews involving recommendation systems or behavioral analysis.
from sklearn.cluster import KMeansfrom sklearn.metrics import silhouette_scoreimport numpy as npimport randomdef initialize_data():"""Initialize data for K-means clustering.Returns:- data: A list of 2D points (tuples of x, y coordinates)- k: Number of clusters to create"""# Set a random seed for reproducibilityrandom.seed(42)# Generate a list of random 2D points# This example creates 100 points between 0 and 10data = [(random.uniform(0, 10), random.uniform(0, 10)) for _ in range(100)]# Choose the number of clusters (k)k = 4return data, kdef k_means_clustering(data, k):#TODO - your implementation here#Return clustersreturn clusters# Initializing the inputsdata, k = initialize_data()#Running the clusteringoutput = k_means_clustering(data, k)print(f"Output: {output}")
Sample answer
Here’s a plan that we can follow to implement our solution: ...