Search⌘ K
AI Features

KNN

Explore how to apply the KNN algorithm for classification and regression tasks using scikit-learn. Understand the concepts of nearest neighbors, distance metrics, and hyperparameter choices such as k and weighting. Learn how to preprocess data, scale features, and use KNeighborsClassifier and KNeighborsRegressor to make predictions. Gain insight into KNN's advantages, limitations, and practical considerations for performance and accuracy.

The k-nearest neighbors (KNN) algorithm makes predictions on new observations by looking at similar observations among data it has seen before. It looks at the values of the other features in the dataset for the rows with missing values, and it uses those values to estimate what the missing value should be. The k in KNN refers to the number of neighbors that the algorithm considers when making its estimation.

For example, suppose we have a dataset in which a particular row has a missing value for one of its features. In this case, we can use KNN to estimate the missing value based on the other available feature values. Specifically, if we set k (number of neighbors) to three, KNN would identify the three rows in the dataset that are most similar to the row in question based on their feature values and use the values of the missing feature from those three rows to impute the missing value for the original row.

A new observation appears, and we don't know which class it belongs to.
1 / 3
A new observation appears, and we don't know which class it belongs to.

The images above illustrate the step-by-step process used to classify observations based on their nearest neighbors:

  1. A new observation appears, and we don’t know its class.

  2. We identify the nearest neighbors.

  3. We use their majority class to infer a class for our new observation.

The way KNN makes estimates depends on the type of task and hyperparameters. For regression tasks, KNN typically computes the mean or median of the k-nearest neighbors as the predicted value, while for classification tasks, KNN usually applies a majority vote among the k-nearest neighbors to determine the predicted class label. The choice of k and the distance metric used to measure similarity between instances can greatly affect the performance of KNN, so it should be chosen ...