K-Nearest Neighbors
Explore the K-Nearest Neighbors algorithm to understand how it classifies new data by finding closest training samples. Learn about distance metrics like Euclidean and cosine similarity, and efficient data structures like KDTree and BallTree that optimize neighbor searches. This lesson helps you grasp key concepts and apply KNN in machine learning tasks.
K-Nearest Neighbors
Nearest Neighbor Methods involve finding a predefined number of training samples that are closest in distance to the new data point. Additionally, they predict the label for the new data point based on the labels of the closest points. The K-nearest Neighbors classifier is the last classifier we will discuss in this chapter. It is a non-generalizing algorithm, as it remembers the whole training dataset instead of learning some function. It is also called the lazy algorithm due to its nature.
Steps involved in KNN
The K-Nearest Neighbors algorithm follows these steps to build the model and then classify a new instance:
-
K-Nearest Neighbor begins by choosing a value of K and a distance or similarity metric.
-
Next, it looks for the K (chosen in the first step) Nearest Neighbors to a given instance in regard to a distance or similarity measure. It looks for such instances in some indexing structures (like KD-Tree and Ball Tree) to optimize its search for neighbors.
-
Then, it decides the class of the new instance by majority vote in the case of classification. Note that in the case of regression, K-Nearest Neighbors takes the mean of the nearest neighbors to predict the outcome of the new instance.
-
Choosing the optimal value of K comes under the domain of hyperparameter optimization, and this value depends more on the underlying data. Using an inappropriate value of K can lead to underfitting or overfitting.
Distance and similarity measures
Distance and similarity measures help us find the nearest neighbors by providing a single-number metric.
Euclidean distance
This is used to find the distance between a set of points in the plane. It presents us with a single number. The greater the number, the more the points are at a far distance (less similar). The lesser the number, the closer the points are (more similar). This is calculated below, and it is embedded in scikit-learn’s implementation of K-Nearest Neighbors:
As an example, the (Euclidean) distance between points (4, -2) and (-5, 7) can be found as:
= ...