You will probably have heard all the buzz about machine learning and its applications. And if you have, you’ve probably heard about
Classification problems: The
Recommender systems: The
Regression analysis: The
Healthcare and medicine: The
In this blog, our focus will be on only classification problems. We’ll take a look at a numerical example running the
Now, let’s explore the basics of the
What is
Let’s take a look at the
Step 1: To classify a test instance
How to define the value of :
The value is usually an odd number to avoid a 50% split in a binary classification problem. A low value of is more error-prone (or might not give accurate results). In practice, the value of is usually set to be aroundsqrt(n), wherenis the number of instances in the dataset.
Step 2: Calculate the similarity (often using Euclidean distance or other distance metrics) between the new item and all data points in the training dataset. Based on the similarity values, extract
Step 3: Assuming multiple classes (
Step 4: Assign
Assume that we have the following dataset:
Data point | Class |
(2, 3) | A |
(3, 4) | A |
(5, 6) | B |
(7, 8) | B |
(1, 2) | A |
(6, 7) | B |
(4, 5) | A |
(8, 9) | B |
(2, 2) | A |
(9, 9) | B |
We also have the following test data point: (6, 5)
Step 1: To classify a test instance (6, 5), define the
As mentioned above, the value of is usually set to sqrt(n), where n is the number of training instances. In our example dataset, n is 10, therefore is set to 3.
Step 2: We calculate the similarity between the new instance and all data points in the training dataset.
Note: We’ll be using the Euclidean distance formula in our calculations.
Recall the formula for Euclidean distance:
Here,
After calculating distances, here is a table of distances from each point to our test point, i.e., (6, 5):
Data point | Distance from (6, 5) |
(2, 3) | 4.47 |
(3, 4) | 3.16 |
(5, 6) | 1.41 |
(7, 8) | 3.16 |
(1, 2) | 5.83 |
(6, 7) | 2 |
(4, 5) | 2 |
(8, 9) | 4.47 |
(2, 2) | 5 |
(9, 9) | 5 |
Step 3: Count the number
Data point | Class | Distance from (6, 5) | Rank |
(2, 3) | A | 4.47 | |
(3, 4) | A | 3.16 | |
(5, 6) | B | 1.41 | 1 |
(7, 8) | B | 3.16 | |
(1, 2) | A | 5.83 | |
(6, 7) | B | 2 | 2 |
(4, 5) | A | 2 | 3 |
(8, 9) | B | 4.47 | |
(2, 2) | A | 5 | |
(9, 9) | B | 5 |
Step 4: Assign the test instance (6, 5) the class that’s the most frequent or the majority class.
From the above table, we can see that in neighborhood of the point (6, 5), two of the closest instances belong to class , whereas one instance belongs to class . Therefore, the test instance (6, 5) is assigned class .
A key part of what is k-NN involves how you measure similarity. The default is often Euclidean distance (straight-line), but in many domains other metrics perform better:
Manhattan (L1): sum of absolute differences; useful when features have independent contributions or grid structure.
Minkowski: generalization with parameter ppp, bridging between Euclidean (p=2p=2p=2) and Manhattan (p=1p=1p=1).
Cosine distance: measures angular difference; useful when magnitude is less important (e.g., text embedding vectors).
Mahalanobis distance: accounts for covariances among features; stretches space to de-emphasize redundant directions.
Your choice impacts classification—if features have varying scales or distributions, Euclidean may mislead. That leads us to scaling.
The following code implements the math library) in this implementation example.
The code is explained below:
Line 1: We start by importing the math library, which is later used to calculate square roots while computing Euclidean distances between two points.
Lines 7–11: We define the euclidean_distance function between two points.
Line 14: We start defining the k_nearest_neighbors algorithm, which takes three arguments: data (the dataset), query_point (the point for which we want to find k (the number of neighbors).
Line 15: We initialize an empty list called distances to store distances between query_point and the data points.
Lines 18–20: We initialize a loop to iterate through each data point in the dataset. We then call the euclidean_distance function that calculates the distance between query_point and the current data_point and store it in the distance variable. Finally, we append a tuple containing the data point and its distance to the distances list.
Line 23: We then sort the distances list in ascending order based on the distances. This step identifies the
Line 26: Next, we select the distances list and store them in the neighbors list.
Lines 29–35: We start by initializing an empty dictionary called class_counts to count the occurrences of each class among the neighbors. Then, we initiate a loop to iterate through each neighbor in the neighbors list. We store the class label (label. We then have an if-else condition. We first check if the class label already exists. If it does, we increment its count by 1. Otherwise, we add it to our dictionary with a count of 1.
Lines 38–39: We now need to determine the majority class. For this, we sort the class_counts dictionary items in descending order, followed by returning the class label with the highest count (majority class).
Lines 42–45: Now, we need to test our query_point to be (6, 5) and set the value of k to be 3. We then call the k_nearest_neighbors function and print the result.
To answer what is k-NN fully, we must stress feature scaling. Suppose one feature is in range [0,1000] and another in [0,1] — the large range dominates distance. Use:
Min–max scaling: normalize to [0, 1]
Z-score standardization: subtract mean, divide by standard deviation
Then weighted voting can refine majority decisions: instead of counting neighbors equally, weight by inverse distance:
vote(c)=∑i∈kNN1d(i)α[class(i)=c]\text{vote}(c) = \sum_{i \in \text{kNN}} \frac{1}{d(i)^\alpha} [\text{class}(i)=c]vote(c)=i∈kNN∑d(i)α1[class(i)=c]
Here α\alphaα is a smoothing exponent (commonly 1 or 2). Closer neighbors’ votes count more heavily—beneficial when neighbors vary in proximity.
Let’s now take a look at a few advantages and disadvantages of using
The algorithm is simple to understand and implement.
Unlike other classification algorithms, there’s no training involved. Learning is instance-based.
The algorithm can adapt to changing data, also known as lazy learning.
No assumptions are made about data distribution.
The models generated by
The algorithm can be computationally expensive with large datasets.
The algorithm has limited ability to capture complex relationships.
The
Scalability can be an issue with large datasets.
Finding the best value of k is central to what is k-NN in practice. Too small → noisy; too large → over-smooth. Use:
Grid search with cross-validation: test candidate k values (e.g. 1, 3, 5, 7, …) via k-fold CV.
Leave-one-out cross validation (LOOCV): useful for small datasets.
For large datasets, pure k-NN is too slow (distance to all points). Common optimizations include:
KD-tree / ball-tree: hierarchical partitioning to prune distance calculations
Approximate nearest neighbor: use locality sensitive hashing (LSH) or product quantization
Dimensionality reduction: PCA, t-SNE or feature selection to reduce input size before k-NN
These techniques make k-NN usable beyond tiny toy datasets.
To really understand what is k-NN, visualization helps. In 2D synthetic data, you can plot the decision boundary: irregular “patchwork” regions where class flips near neighborhoods. Pitfalls become visible:
Sparse regions: in low-density areas, neighbors may come from far away, degrading reliability.
Class imbalance: the majority class may dominate neighborhoods even where minority class is relevant.
High dimensionality curse: in many dimensions, distance becomes less discriminative—neighbors all look nearly equally far.
Visual aids help learners internalize where k-NN works well and where it fails.
This blog has provided a thorough answer to what is the -NN algorithm. We mentioned some use-cases, along with advantages and disadvantages of using the -NN algorithm for classification. We also demonstrated our example with code without using any Pyhton libraries.
Don’t stop here! You can explore and practice different techniques and libraries to build more accurate and robust models. We encourage you to check out the following courses on Educative:
A Practical Guide to Machine Learning with Python
This course teaches you how to code basic machine learning models. The content is designed for beginners with general knowledge of machine learning, including common algorithms such as linear regression, logistic regression, SVM, KNN, decision trees, and more. If you need a refresher, we have summarized key concepts from machine learning, and there are overviews of specific algorithms dispersed throughout the course.
Machine Learning with Python Libraries
Machine learning is used for software applications that help them generate more accurate predictions. It is a type of artificial intelligence operating worldwide and offers high-paying careers. This path will provide a hands-on guide on multiple Python libraries that play an important role in machine learning. This path also teaches you about neural networks, PyTorch Tensor, PyCaret, and GAN. By the end of this module, you’ll have hands-on experience in using Python libraries to automate your applications.
Mastering Machine Learning Theory and Practice
The machine learning field is rapidly advancing today due to the availability of large datasets and the ability to process big data efficiently. Moreover, several new techniques have produced groundbreaking results for standard machine learning problems. This course provides a detailed description of different machine learning algorithms and techniques, including regression, deep learning, reinforcement learning, Bayes nets, support vector machines (SVMs), and decision trees. The course also offers sufficient mathematical details for a deeper understanding of how different techniques work. An overview of the Python programming language and the fundamental theoretical aspects of ML, including probability theory and optimization, is also included. The course contains several practical coding exercises as well. By the end of the course, you will have a deep understanding of different machine-learning methods and the ability to choose the right method for different applications.