K-Nearest Neighbors

K-Nearest Neighbor classifiers do the job of classification by looking for the closest instances in properties to the given instance. You will learn more here.

K-Nearest Neighbors

Nearest Neighbor Methods involve finding a predefined number of training samples that are closest in distance to the new data point. Additionally, they predict the label for the new data point based on the labels of the closest points. The K-nearest Neighbors classifier is the last classifier we will discuss in this chapter. It is a non-generalizing algorithm, as it remembers the whole training Dataset instead of learning some function. It is also called the lazy algorithm due to its nature.

Steps involved in KNN

The K-Nearest Neighbors algorithm follows these steps to build the model and then classify a new instance.

  1. K-Nearest Neighbor begin by choosing a value of K and a Distance or Similarity Metric.

  2. Next, it looks for the K (chosen in the first step) Nearest Neighbors to a given instance in regard to a Distance or Similarity Measure. It looks for such instances in some Indexing Structures (like KD-Tree and Ball Tree) to optimize its search for neighbors .

3.Then, it decides the class of the new instance by majority vote in the case of classification. Note that in cCase of Regression, K-Nearest Neighbors takes the mean of the nearest neighbors to predict the outcome of the new instance. 4. Choosing the optimal value of K comes under the domain of hyperparameter optimization, and this value depends more on the underlying data. Selecting a bad K can lead to underfitting or overfitting.

Distance and Similarity Measures

Distance and Similarity Measures help us find the nearest neighbors by providing a single number metric.

Euclidean Distance

This is used to find the distance between a set of points in the plane. It presents us with a single number. The greater the number, the more the points are at a far distance (less similar). The lesser the number, the more close the points are (more similar). This is calculated below and it is embedded in Scikit’s Learn implementation of K-Nearest Neighbors.

dist((x,y),(a,b))=(xa)2+(yb)2dist((x,y),(a,b)) = \sqrt{(x-a)^2 + (y-b)^2}

As an example, the (Euclidean) distance between points (4, -2) and (-5, 7) can be found as:

dist((4,2),(5,7))dist((4, -2),(-5, 7)) = (4(5))2+(27)2=12.72\sqrt{(4-(-5))^{2}+(-2-7)^2} =12.72

Cosine Similarity

Cosine Similarity is a metric used to measure the similarity between items. Cosine similarity deals with measuring the cosine of the angles between two vectors in a multi-dimensional space and holds a value between 0 and 1. It is a number between 0 and 1. The closer the result is to 1, the more similar the items are and vice versa. Cosine Similarity is calculated as seen below.

sim(x,y)=x.yxysim(x,y)=\frac{x.y}{|x||y|}

where x,yx, y are the vectors and |x|, |y| are the euclidean norms of the vectors.

Suppose that x=(5,0,3,0,2,0,0,2,0,0)x=(5,0,3,0,2,0,0,2,0,0) and y=(3,0,2,0,1,1,0,1,0,1)y=(3,0,2,0,1,1,0,1,0,1)

x.yx.y = 5×3+0×0+3×2+0×0+2×1+0×1+0×0+2×1+0×0+0×1=255×3+0×0+3×2+0×0+2×1+0×1+0×0+2×1+0×0+0×1=25

x=52+02+32+02+22+02+02+22+02+02=6.48|x| =\sqrt{5^2+0^2+3^2+0^2+2^2+0^2+0^2+2^2+0^2+0^2}=6.48

y=32+02+22+02+12+12+02+12+02+12=4.12|y|=\sqrt{3^2+0^2+2^2+0^2+1^2+1^2+0^2+1^2+0^2+1^2}=4.12

sim(x,y)=25(6.48)(4.12)sim(x,y)=\frac{25}{(6.48)(4.12)}

sim(x,y)=0.94sim(x,y)=0.94

Nearest Neighbor algorithms

Get hands-on with 1200+ tech skills courses.