...

/

Visualize the Working of K-Nearest Neighbors

Visualize the Working of K-Nearest Neighbors

Learn to visualize the working principle behind k-nearest neighbors.

Let’s move on and practically do what we have learned so far. As always, we need to import some basic libraries.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt;
import seaborn as sns
sns.set_style('whitegrid') # just optional!
sns.set(font_scale=1.5) # setting font size for the whole notebook
sns.set_style("whitegrid") # setting the style

Let's generate a dataset with two classes and see how the KNN algorithm works in reality for any new data points while assigning the class.

The dataset

We can use make_biclusters() from scikit-learn to create a simple dataset with two features (columns) and 50 observations (data points). We can also add Gaussian noise while creating clusters and assign them a class. Let's do this.

Python 3.8
## Generate 2 random clusters, create dataframe
from sklearn.datasets import make_biclusters # to generate data
X, classes, cols= make_biclusters(shape=(50,2), # features (n_row,n_cols)
n_clusters=2, # number of classes we want
noise=50,# The standard deviation of the gaussian noise.
random_state=101) # to re-generate same data everytime
# Creating dataframe
df = pd.DataFrame(X, columns=['feature_2','feature_1'])
df['target']= classes[0]
# Well, instead of True/False, lets replace with 1/0 targets -- a practice for map and lambda!
df['target'] = df['target'].map(lambda t: '1' if t==0 else '0')
print(df.tail(2)) # tail this time!

Let's check the class distribution.

Python 3.8
print(df.target.value_counts())

As seen from the code output above, we have the data with two features and a target column.

Visualize training and the test data

Let's create a scatterplot and visualize the distribution of data points. We can use the hue parameter for classes to show in different colors. In another plot (right side), we can add a test point for which the class is unknown, and we want KNN to predict its class.

Python 3.8
# Figure 1 (left)
fig,(ax1,ax2)=plt.subplots(nrows=1,ncols=2,figsize=(16,8))
sns.scatterplot(x='feature_1',y='feature_2',data=df,hue='target',ax=ax1,s=150)
ax1.set_title("The data -- two classes")
ax1.set_xlabel('Feature 1')
ax1.set_ylabel('Feature 2')
ax1.legend().set_title('Target')
# Plot our new point
test_point=[[10,50]]
# Figure 2 (right)
sns.scatterplot(x='feature_1',y='feature_2',data=df,hue='target',ax=ax2,s=150)
ax2.scatter(x=test_point[0][0],y=test_point[0][1],color="red",marker="*",s=1000)
ax2.set_title('Red star is a test (unknown) point')
ax2.set_xlabel('Feature 1')
ax2.set_ylabel('Feature 2')
ax2.legend().set_title('Target')

The red star is a new unknown data point that we want our KNN algorithm to predict, and for this purpose, we need to perform the following ...