Search⌘ K
AI Features

The Essential k-NN Algorithm

Explore the essential k-Nearest Neighbors algorithm, including steps to classify data points and how to optimize performance by minimizing sorting overhead with Python's bisect and heapq modules. Understand computational costs and benchmark different approaches to enhance algorithm efficiency.

Overview

We can summarize the kk-NN algorithm as having the following steps:

  1. Create a list of all (distance, training sample) pairs.
  2. Sort these in ascending order.
  3. Pick to the first kk, which will be the k nearest neighbors.
  4. Chose the mode (the highest frequency) label for the kk nearest neighbors.

The implementation would look like this:

Python 3.10.4
class Measured(NamedTuple):
distance: float
sample: TrainingKnownSample
def k_nn_1(
k: int, dist: DistanceFunc, training_data: TrainingList,
unknown: AnySample) -> str:
distances = sorted(
map(
lambda t: Measured(dist(t, unknown), t), training_data) )
k_nearest = distances[:k]
k_frequencies: Counter[str] = collections.Counter(
s.sample.sample.species for s in k_nearest
)
mode, fq = k_frequencies.most_common(1)[0]
return mode

While clear, this does accumulate a large number of distance values in the distances list object, when only kk are actually needed. The sorted() function consumes the source generator and creates a (potentially large) list of intermediate values.

One of the high-cost parts of this specific kk-NN algorithm is sorting the entire set of training data after the distances have been computed. We summarize the complexity with the description as an O(nlogn)O(n \log n) operation. A way to avoid cost is to avoid sorting the entire set of distance computations.

Steps ...