Trusted answers to developer questions
Trusted Answers to Developer Questions

Related Tags

clustering
hashing algorithm

What is Locality Sensitive Hashing (LSH)?

Ateeq Ur Rehman Baig

Grokking Modern System Design Interview for Engineers & Managers

Ace your System Design Interview and take your career to the next level. Learn to handle the design of applications like Netflix, Quora, Facebook, Uber, and many more in a 45-min interview. Learn the RESHADED framework for architecting web-scale applications by determining requirements, constraints, and assumptions before diving into a step-by-step design process.

Answers Code

Introduction

Locality Sensitive Hashing (LSH) is a technique widely applicable to the approximate similarity search. It’s used because comparing billions of data points in current-day searches is not practically feasible. Therefore, we need a holistic method for dimensionality reduction to limit our search scope to highly relatable data points only.

LSH is an algorithmic approach to clustering similar input points into the same buckets using their hash values.

Contrary to the other hashing techniques, LSH aims to maximize the collisions of similar input points, ensuring a high probability of placing them in a single bucket.

LSH's comparison with other hash functions

Technique

There are different versions of LSH depending on the variation of hashing functions and similarity metrics. The two prominent versions are as follows:

  1. The traditional approach: Shingling, MinHashing, and Banded LSH.
  2. A simpler approach: Random hyperplanes through a unit sphere containing the input points, bit vectors, and Hamming distances.

Here, we explore the hyperplane approach in more detail. Let’s go through its algorithm.

Algorithm

  1. Generate a random hyperplane passing through the unit sphere containing the input points.
  2. Assign a binary bit value to all the input points based on their placements in the dichotomies produced by the passing hyperplane. We assign a 0 (negative) value to all the points in the upward portion of the hyperplane and a 1 (positive) value to the points residing in the downward portion.

Note: Practically, we take the dot product of the hyperplane’s normal vector and the input vector and assign values based on the result: 1 if they both are in the same direction and 0 if they are in the opposite direction.

  1. Repeat step 2 a fixed number of times with varying placements of hyperplanes. This allows us to generate hashed vectors with lower dimensionality (equal to the number of hyperplanes). In short, it generates varying bucket configurations of input vectors for various iterations and helps us improve their final bucket allocation.

  2. Once we have our index of hashed vectors for input points, we can find the similarity index of any query vector by calculating its Hamming distance with each one of the index vectors. The higher the Hamming distance, the lower the similarity, and vice versa.

Input points in the sphere
1 of 4

Based on the above illustration, the hashed vectors for three input points come out to be 101, 101, and 001, respectively.

Let’s assume there comes a query point with the hash value of 111. We can calculate the Hamming distanceNumber of mismatches between the vectors’ individual bits of the incoming query point with all the indexed vectors and find the closest match.

The following are the Hamming distances of all the comparisons:

  1. H(101, 111) = 1
  2. H(101, 111) = 1
  3. H(001, 111) = 2

Since we take the lowest Hamming number as an indicator of the closest match, points 1 and 2 are approximately similar to the query point.

Computational cost

Let’s assume we have N-dimensional P points as inputs, and we want to generate a total of K hyperplanes, approximately equal to the natural logarithm of P (logPlog P).

  • We can expect the final placement of a point in a bucket in the time of N × KN\ \times\ K.
  • The total expected collisions would be P2K\frac{P}{2^{K}}, making the total comparison cost for all the dimensions N × P2K\frac{N\ \times\ P}{2^{K}}.
  • The total computational cost for the locality sensitive hashing comes out to be O(logP)O(log P) where KlogPK \sim log P.

RELATED TAGS

clustering
hashing algorithm

CONTRIBUTOR

Ateeq Ur Rehman Baig
Copyright ©2022 Educative, Inc. All rights reserved

Grokking Modern System Design Interview for Engineers & Managers

Ace your System Design Interview and take your career to the next level. Learn to handle the design of applications like Netflix, Quora, Facebook, Uber, and many more in a 45-min interview. Learn the RESHADED framework for architecting web-scale applications by determining requirements, constraints, and assumptions before diving into a step-by-step design process.

Answers Code
Keep Exploring