Working Principle of K-Nearest Neighbors

Learn about the working principle behind k-nearest neighbors.

We have explored the evaluation methods for linear regression (continuous target) and logistic regression (class prediction). In a classification problem, we're generally concerned about identifying anything incorrectly. We want our algorithm to predict correctly as much as possible on the test dataset for generalization. KNN is another straightforward and widely used algorithm, typically for classification. However, it can be used for regression tasks as well.

KNN review and distance functions

As discussed in the previous lesson, KNN considers how many observations belong to a particular class within the selected kk (number of neighbors) value and decides on more votes for a test data class. The algorithm stores all available data points and computes their distances from the test data for its classification task. It's also important to remember that the physical units of the features are critical in this case, and all have to be on a similar scale. Feature scaling (standardization or normalization) can drastically improve the model performance.

Distance functions for KNN

Typically, the KNN algorithm uses Euclidean or Manhattan distance functions. Other distance metrics are rarely used for computing the distances. We can also create our distance function in the algorithm (depending on our customers’ needs). A few standard distance functions for KNN are as follows:

Euclidean distance

Euclidean is a straight line distance between two points. This is the most common choice and can be calculated using the Pythagorean theorem from the cartesian coordinates between two points. Sometimes, the Euclidean distance is also called the Pythagorean distance.

Get hands-on with 1200+ tech skills courses.