The Essential k-NN Algorithm
Explore the essential k-Nearest Neighbors algorithm, including steps to classify data points and how to optimize performance by minimizing sorting overhead with Python's bisect and heapq modules. Understand computational costs and benchmark different approaches to enhance algorithm efficiency.
We'll cover the following...
Overview
We can summarize the -NN algorithm as having the following steps:
- Create a list of all (distance, training sample) pairs.
- Sort these in ascending order.
- Pick to the first , which will be the k nearest neighbors.
- Chose the mode (the highest frequency) label for the nearest neighbors.
The implementation would look like this:
While clear, this does accumulate a large number of distance values in the distances list object, when only are actually needed. The sorted() function consumes the source generator and creates a (potentially large) list of intermediate values.
One of the high-cost parts of this specific -NN algorithm is sorting the entire set of training data after the distances have been computed. We summarize the complexity with the description as an operation. A way to avoid cost is to avoid sorting the entire set of distance computations.
Steps ...