What is k-NN?

Overview#

You will probably have heard all the buzz about machine learning and its applications. And if you have, you’ve probably heard about $k$ -nearest neighbors ( $k$ -NN). This algorithm is one of the simplest and easy-to-understand classification and regression algorithms and can be used for many practical applications. Some of these applications include:

Classification problems: The $k$ -NN algorithm can be used in pattern recognition and other classification problems, such as identification of spam emails and classification of documents.
Recommender systems: The $k$ -NN algorithm can be applied to find similar users or items and make recommendations.
Regression analysis: The $k$ -NN algorithm can also be used for regression tasks, such as predicting housing prices based on features such as area, location, and number of bedrooms.
Healthcare and medicine: The $k$ -NN algorithm can assist in identifying the likelihood of certain diseases based on patient data and historical records.

In this blog, our focus will be on only classification problems. We’ll take a look at a numerical example running the $k$ -NN algorithm to see how it works. We’ll also run the example via Python code. Finally, at the end of this blog, we’ll have a look at some advantages and disadvantages of using the $k$ -NN algorithm for classification purposes.

Now, let’s explore the basics of the $k$ -NN algorithm.

Introduction to kkk-NN#

What is $k$ -NN? As mentioned above, $k$ -NN is a widely recognized classification technique used to assign items to particular categories based on how similar they are to nearby data points. It falls under the category of instance-based or lazy learning algorithms. Unlike some other algorithms that build explicit models during training, $k$ -NN makes predictions by finding the most similar data points in the training dataset to the item being classified.

Choice of distance metric and impact#

A key part of what is k-NN involves how you measure similarity. The default is often Euclidean distance (straight-line), but in many domains other metrics perform better:

Manhattan (L1): sum of absolute differences; useful when features have independent contributions or grid structure.
Minkowski: generalization with parameter ppp, bridging between Euclidean (p=2p=2p=2) and Manhattan (p=1p=1p=1).
Cosine distance: measures angular difference; useful when magnitude is less important (e.g., text embedding vectors).
Mahalanobis distance: accounts for covariances among features; stretches space to de-emphasize redundant directions.

Your choice impacts classification—if features have varying scales or distributions, Euclidean may mislead. That leads us to scaling.

kkk-NN algorithm in Python#

The following code implements the $k$ -NN algorithm in Python. Note that we won’t be using any libraries (except for the math library) in this implementation example.

Python 3.10.4

import math
# Sample dataset
data = [(2, 3, 'A'), (3, 4, 'A'), (5, 6, 'B'), (7, 8, 'B'), (1, 2, 'A'), (6, 7, 'B'), (4, 5, 'A'), (8, 9, 'B'), (2, 2, 'A'), (9, 9, 'B')]
# Function to calculate Euclidean distance between two points
def euclidean_distance(point1, point2):
    distance = 0
    distance = math.sqrt((point1[0] - point2[0])**2 + (point1[1] - point2[1])**2)
    
    return distance
# k-NN algorithm
def k_nearest_neighbors(data, query_point, k):
    distances = []
    # Calculate distances from the query point to all data points
    for data_point in data:
        distance = euclidean_distance(query_point, data_point)
        distances.append((data_point, distance))
    # Sort distances in ascending order
    distances.sort(key=lambda x: x[1])
    
    # Get the k-nearest neighbors
    neighbors = [item[0] for item in distances[:k]]
    # Count the occurrences of each class among the neighbors
    class_counts = {}
    for neighbor in neighbors:
        label = neighbor[2]
        if label in class_counts:
            class_counts[label] += 1
        else:
            class_counts[label] = 1
    # Determine the majority class
    sorted_class_counts = sorted(class_counts.items(), key=lambda x: x[1], reverse=True)
    return sorted_class_counts[0][0]
# Test the k-NN algorithm
query = (6,5)
k = 3
result = k_nearest_neighbors(data, query, k)
print(f"The query point {query} belongs to class: {result}")

The code is explained below:

Line 1: We start by importing the math library, which is later used to calculate square roots while computing Euclidean distances between two points.

Lines 7–11: We define the euclidean_distance function between two points.

Line 14: We start defining the k_nearest_neighbors algorithm, which takes three arguments: data (the dataset), query_point (the point for which we want to find $k$ -nearest neighbors), and k (the number of neighbors).

Line 15: We initialize an empty list called distances to store distances between query_point and the data points.

Lines 18–20: We initialize a loop to iterate through each data point in the dataset. We then call the euclidean_distance function that calculates the distance between query_point and the current data_point and store it in the distance variable. Finally, we append a tuple containing the data point and its distance to the distances list.

Line 23: We then sort the distances list in ascending order based on the distances. This step identifies the $k$ -nearest neighbors.

Line 26: Next, we select the $k$ -nearest neighbors from the sorted distances list and store them in the neighbors list.

Lines 29–35: We start by initializing an empty dictionary called class_counts to count the occurrences of each class among the neighbors. Then, we initiate a loop to iterate through each neighbor in the neighbors list. We store the class label ( $A$ or $B$ ) for the current neighbor in the variable called label. We then have an if-else condition. We first check if the class label already exists. If it does, we increment its count by 1. Otherwise, we add it to our dictionary with a count of 1.

Lines 38–39: We now need to determine the majority class. For this, we sort the class_counts dictionary items in descending order, followed by returning the class label with the highest count (majority class).

Lines 42–45: Now, we need to test our $k$ -NN algorithm. For that, we define query_point to be (6, 5) and set the value of k to be 3. We then call the k_nearest_neighbors function and print the result.

Feature scaling, normalization, and weighted voting#

To answer what is k-NN fully, we must stress feature scaling. Suppose one feature is in range [0,1000] and another in [0,1] — the large range dominates distance. Use:

Min–max scaling: normalize to [0, 1]
Z-score standardization: subtract mean, divide by standard deviation

Then weighted voting can refine majority decisions: instead of counting neighbors equally, weight by inverse distance:

vote(c)=∑i∈kNN1d(i)α[class(i)=c]\text{vote}(c) = \sum_{i \in \text{kNN}} \frac{1}{d(i)^\alpha} [\text{class}(i)=c]vote(c)=i∈kNN∑d(i)α1[class(i)=c]

Here α\alphaα is a smoothing exponent (commonly 1 or 2). Closer neighbors’ votes count more heavily—beneficial when neighbors vary in proximity.

Advantages and disadvantages of using the kkk-NN classification algorithm#

Let’s now take a look at a few advantages and disadvantages of using $k$ -NN for classification tasks.

Advantages#

The algorithm is simple to understand and implement.
Unlike other classification algorithms, there’s no training involved. Learning is instance-based.
The algorithm can adapt to changing data, also known as lazy learning.
No assumptions are made about data distribution.
The models generated by $k$ -NN are interpretable. We can easily visualize decision boundaries.

Disadvantages#

The algorithm can be computationally expensive with large datasets.
$k$ -NN is sensitive to the choice of $k$ .
The algorithm has limited ability to capture complex relationships.
The $k$ -NN algorithm might suffer from the curse of dimensionality. This curse refers to the phenomenon where the performance of algorithms such as $k$ -NN degrades as the number of features or dimensions in the dataset increases.
Scalability can be an issue with large datasets.

Choice of k, cross-validation, and speedups#

Finding the best value of k is central to what is k-NN in practice. Too small → noisy; too large → over-smooth. Use:

Grid search with cross-validation: test candidate k values (e.g. 1, 3, 5, 7, …) via k-fold CV.
Leave-one-out cross validation (LOOCV): useful for small datasets.

For large datasets, pure k-NN is too slow (distance to all points). Common optimizations include:

KD-tree / ball-tree: hierarchical partitioning to prune distance calculations
Approximate nearest neighbor: use locality sensitive hashing (LSH) or product quantization
Dimensionality reduction: PCA, t-SNE or feature selection to reduce input size before k-NN

These techniques make k-NN usable beyond tiny toy datasets.

Visualizing decision boundaries and pitfalls#

To really understand what is k-NN, visualization helps. In 2D synthetic data, you can plot the decision boundary: irregular “patchwork” regions where class flips near neighborhoods. Pitfalls become visible:

Sparse regions: in low-density areas, neighbors may come from far away, degrading reliability.
Class imbalance: the majority class may dominate neighborhoods even where minority class is relevant.
High dimensionality curse: in many dimensions, distance becomes less discriminative—neighbors all look nearly equally far.

Visual aids help learners internalize where k-NN works well and where it fails.

Conclusion and next steps#

A Practical Guide to Machine Learning with Python

A Practical Guide to Machine Learning with Python

This course teaches you how to code basic machine learning models. The content is designed for beginners with general knowledge of machine learning, including common algorithms such as linear regression, logistic regression, SVM, KNN, decision trees, and more. If you need a refresher, we have summarized key concepts from machine learning, and there are overviews of specific algorithms dispersed throughout the course.

72hrs 30mins

Beginner

108 Playgrounds

12 Quizzes

Machine Learning with Python Libraries

Machine learning is used for software applications that help them generate more accurate predictions. It is a type of artificial intelligence operating worldwide and offers high-paying careers. This path will provide a hands-on guide on multiple Python libraries that play an important role in machine learning. This path also teaches you about neural networks, PyTorch Tensor, PyCaret, and GAN. By the end of this module, you’ll have hands-on experience in using Python libraries to automate your applications.

53hrs

Beginner

34 Challenges

52 Quizzes

Mastering Machine Learning Theory and Practice

The machine learning field is rapidly advancing today due to the availability of large datasets and the ability to process big data efficiently. Moreover, several new techniques have produced groundbreaking results for standard machine learning problems. This course provides a detailed description of different machine learning algorithms and techniques, including regression, deep learning, reinforcement learning, Bayes nets, support vector machines (SVMs), and decision trees. The course also offers sufficient mathematical details for a deeper understanding of how different techniques work. An overview of the Python programming language and the fundamental theoretical aspects of ML, including probability theory and optimization, is also included. The course contains several practical coding exercises as well. By the end of the course, you will have a deep understanding of different machine-learning methods and the ability to choose the right method for different applications.

36hrs

Beginner

109 Playgrounds

10 Quizzes

Data point	Class
(2, 3)	A
(3, 4)	A
(5, 6)	B
(7, 8)	B
(1, 2)	A
(6, 7)	B
(4, 5)	A
(8, 9)	B
(2, 2)	A
(9, 9)	B

Data point	Distance from (6, 5)
(2, 3)	4.47
(3, 4)	3.16
(5, 6)	1.41
(7, 8)	3.16
(1, 2)	5.83
(6, 7)	2
(4, 5)	2
(8, 9)	4.47
(2, 2)	5
(9, 9)	5

Data point	Class	Distance from (6, 5)	Rank
(2, 3)	A	4.47
(3, 4)	A	3.16
(5, 6)	B	1.41	1
(7, 8)	B	3.16
(1, 2)	A	5.83
(6, 7)	B	2	2
(4, 5)	A	2	3
(8, 9)	B	4.47
(2, 2)	A	5
(9, 9)	B	5

What is k-NN?

Overview#

Introduction to kkk-NN#

How kkk-NN works#

Example#

Choice of distance metric and impact#

kkk-NN algorithm in Python#

Feature scaling, normalization, and weighted voting#

Advantages and disadvantages of using the kkk-NN classification algorithm#

Advantages#

Disadvantages#

Choice of k, cross-validation, and speedups#

Visualizing decision boundaries and pitfalls#

Conclusion and next steps#

#