Trusted answers to developer questions

The Levenshtein distance algorithm

Get Started With Data Science

Learn the fundamentals of Data Science with this free course. Future-proof your career by adding Data Science skills to your toolkit — or prepare to land a job in AI, Machine Learning, or Data Analysis.

widget

The Levenshtein distance (a.k.a edit distance) is a measure of similarity between two strings. It is defined as the minimum number of changes required to convert string a into string b (this is done by inserting, deleting or replacing a character in string a). The smaller the Levenshtein distance, the more similar the strings are. This is a very common problem in the application of Dynamic Programming.

Explanation

Using sub-problems

A pre-requisite for applying Dynamic Programming, to solve a problem, is to demonstrate that the solution to the original problem can be found by using the solutions to the sub-problems.

The approach here is somewhat simple and intuitive. Consider the strings a and b up to their last character:

  1. If the last characters of both strings are the same, then the edit distance is equal to the edit distance of the same two strings, up to their second-to-last character.
  2. If the last character is different, then the edit distance is equal to the minimum of the cost of inserting, deleting, or replacing the last character of string a.

Understanding the array

Since DP utilises memory for storing the solutions to sub-problems, we will use a 2D array (D) for this purpose.

The value in D[i][j] represents the edit distance if we consider the strings a and b, till their ith and jth character, respectively. Therefore, the answer to the original problem will be found in D[length of 'a'][length of 'b'].

How sub-problems are used in Levenshtein's algorithm
How sub-problems are used in Levenshtein's algorithm

Code

#include <iostream>
#include <cstring>
#include <algorithm>
using namespace std;
int findMin(int x, int y, int z){
if(x <= y && x <= z)
return x;
else if(y <=x && y <= z)
return y;
else
return z;
}
void freeMemory(int** arr, int a, int b){
for(int i = 0; i < a; i++){
delete[] arr[i];
}
delete[] arr;
}
int findDistance(char a[], char b[]){
// Declaring a 2D array on the heap dynamically:
int len_a = strlen(a);
int len_b = strlen(b);
int** d = new int*[len_a + 1];
for(int i = 0; i < len_a + 1; i++)
d[i] = new int[len_b + 1];
// Initialising first column:
for(int i = 0; i < len_a + 1; i++)
d[i][0] = i;
// Initialising first row:
for(int j = 0; j < len_b + 1; j++)
d[0][j] = j;
// Applying the algorithm:
int insertion, deletion, replacement;
for(int i = 1; i < len_a + 1; i++){
for(int j = 1; j < len_b + 1; j++){
if(a[i - 1] == b[j - 1]){
d[i][j] = d[i - 1][j - 1];
}
else{
// Choosing the best option:
insertion = d[i][j - 1];
deletion = d[i - 1][j];
replacement = d[i - 1][j - 1];
d[i][j] = 1 + findMin(insertion, deletion, replacement);
}
}
}
int answer = d[len_a][len_b];
freeMemory(d, len_a + 1, len_b + 1);
return answer;
}
int main() {
char a[] = "Carry";
char b[] = "Bark";
cout << "Levenshtein Distance: " << findDistance(a, b);
return 0;
}

RELATED TAGS

Copyright ©2024 Educative, Inc. All rights reserved
Did you find this helpful?