Edit Distance

Get to know the edit distance problem and its solution using dynamic programming.

We'll cover the following

Edit distance and its applications

The edit distance between two strings is the minimum number of letter insertions, letter deletions, and letter substitutions required to transform one string into the other. For example, the edit distance between ${\color{Red} FOOD}$ and ${\color{Red} MONEY}$ is at most four:

${\color{Red} \underset{-}{F}OOD } \to {\color{Red}MO\underset{-}{O}D} \to {\color{Red} MON\underset{\wedge}{}D} \to {\color{Red} MONE\underset{-}D} \to {\color{Red} MONEY}$

This distance function was independently proposed by Vladimir Levenshtein in 1965 (working on coding theory), Taras Vintsyuk in 1968 (working on speech recognition), and Stanislaw Ulam in 1972 (working with biological sequences). For this reason, edit distance is sometimes called Levenshtein distance or Ulam distance (but strangely, never “Vintsyuk distance”).

We can visualize this editing process by aligning the strings one above the other, with a gap in the first word for each insertion and a gap in the second word for each deletion. Columns with two different characters correspond to substitutions. In this representation, the number of editing steps is just the number of columns that do not contain the same character twice.

$\hspace{6cm}{\color{Red} F\hspace{0.3cm} O \hspace{0.3cm}O \hspace{0.87cm} D} \\ \hspace{5.9cm}{\color{Red} M\hspace{0.275cm} O \hspace{0.27cm} N \hspace{0.3cm} \underset{.}E \hspace{0.3cm} Y}$

It’s fairly obvious that we can’t transform ${\color{Red} FOOD}$ into ${\color{Red} MONEY}$ in three steps, so the edit distance between ${\color{Red} FOOD}$ and ${\color{Red} MONEY}$ is exactly four. Unfortunately, it’s not so easy in general to tell when a sequence of edits is as short as possible. For example, the following alignment shows that the distance between the strings ${\color{Red} ALGORITHM}$ and ${\color{Red} ALTRUISTIC}$ is at most 6. Is that the best we can do?

$\hspace{5.94cm}{\color{Red} A\hspace{0.3cm} L \hspace{0.3cm}G \hspace{0.3cm} O \hspace{0.3cm} R \hspace{0.9cm} I \hspace{0.84cm} T \hspace{0.3cm} H \hspace{0.3cm} M} \\ \hspace{5.9cm}{\color{Red} A\hspace{0.275cm} L \hspace{0.94cm} T \hspace{0.3cm} R\hspace{0.3cm} U \hspace{0.3cm} I \hspace{0.3cm} S \hspace{0.3cm} T \hspace{0.35cm} I \hspace{0.45cm} C}$

Recursive structure

To develop a dynamic programming algorithm to compute edit distance, we first need to formulate the problem recursively. Our alignment representation for edit sequences has a crucial “optimal substructure” property. Suppose we have the gap representation for the shortest edit sequence for two strings. If we remove the last column, the remaining columns must represent the shortest edit sequence for the remaining prefixes. We can easily prove this observation by contradiction: If the prefixes had a shorter edit sequence, gluing the last column back on would give us a shorter edit sequence for the original strings. So once we figure out what should happen in the last column, the Recursion Fairy can figure out the rest of the optimal gap representation.

Said differently, the alignment we are looking for represents a sequence of editing operations, ordered (for no particular reason) from right to left. Solving the edit distance problem requires making a sequence of decisions, one for each column in the output alignment. In the middle of this sequence of decisions, we have already aligned a suffix of one string with a suffix of the other.