Search⌘ K
AI Features

Edit Distance

Explore the concept of edit distance, which quantifies the minimum edits needed to transform one string into another. Learn to implement a dynamic programming solution in Python that efficiently calculates this distance by considering insertions, deletions, and substitutions.

Edit distance and its applications

The edit distance between two strings is the minimum number of letter insertions, letter deletions, and letter substitutions required to transform one string into the other. For example, the edit distance between FOOD{\color{Red} FOOD} and MONEY{\color{Red} MONEY} is at most four:

FOODMOODMONDMONEDMONEY{\color{Red} \underset{-}{F}OOD } \to {\color{Red}MO\underset{-}{O}D} \to {\color{Red} MON\underset{\wedge}{}D} \to {\color{Red} MONE\underset{-}D} \to {\color{Red} MONEY}

This distance function was independently proposed by Vladimir Levenshtein in 1965 (working on coding theory), Taras Vintsyuk in 1968 (working on speech recognition), and Stanislaw Ulam in 1972 (working with biological sequences). For this reason, edit distance is sometimes called Levenshtein distance or Ulam distance (but strangely, never “Vintsyuk distance”).

We can visualize this editing process by aligning the strings one above the other, with a gap in the first word for each insertion and a gap in the second word for each deletion. Columns with two different characters correspond to substitutions. In this representation, the number of editing steps is just the number of columns that do not contain the same character twice.

FOODMONE.Y\hspace{6cm}{\color{Red} F\hspace{0.3cm} O \hspace{0.3cm}O \hspace{0.87cm} D} \\ \hspace{5.9cm}{\color{Red} M\hspace{0.275cm} O \hspace{0.27cm} N \hspace{0.3cm} \underset{.}E \hspace{0.3cm} Y}

It’s fairly obvious that we can’t transform FOOD{\color{Red} FOOD} into MONEY{\color{Red} MONEY} in three steps, so the edit distance between FOOD{\color{Red} FOOD} and MONEY{\color{Red} MONEY} is exactly four. Unfortunately, it’s not so easy in general to tell when a sequence of edits is as short as possible. For example, the following alignment shows that the distance between the strings ALGORITHM{\color{Red} ALGORITHM} and ALTRUISTIC{\color{Red} ALTRUISTIC} is at most 6. Is that the best we can do?

ALGORITHMALTRUISTIC\hspace{5.94cm}{\color{Red} A\hspace{0.3cm} L \hspace{0.3cm}G \hspace{0.3cm} O \hspace{0.3cm} R \hspace{0.9cm} I \hspace{0.84cm} T \hspace{0.3cm} H \hspace{0.3cm} M} \\ \hspace{5.9cm}{\color{Red} A\hspace{0.275cm} L \hspace{0.94cm} T \hspace{0.3cm} R\hspace{0.3cm} U \hspace{0.3cm} I \hspace{0.3cm} S \hspace{0.3cm} T \hspace{0.35cm} I \hspace{0.45cm} C} ...