What is the Smith-Waterman similarity measure in NLP?
The Smith-Waterman similarity measure is used in NLP to find the local similarity between two strings. It’s commonly used in bioinformatics to find similar gene sequences, but has also found usage in NLP for tasks like multiworded keyword search and finding similarity in erroneous strings, where conventional approaches like bag-of-words (BOW) might not be effective.
Smith-Waterman algorithm
The Smith-Waterman algorithm is a dynamic programming algorithm employed to calculate the Smith-Waterman similarity measure. Its utility is in its flexibility to penalize differences in the strings being compared.
For two strings
Define the similarity score of the algorithm as
, where and are two string characters. Define the gap penalty of the strings by an array
, where is the length of the gap. By assigning a negative score of our choice to the gaps in the sequences, we can control how the gaps influence the similarity score. Create a zero matrix
of size . Fill the entries of
using the following equation:
Here,
The greatest element of
is the Smith-Waterman similarity measure.
The time complexity of the Smith-Waterman algorithm of two strings is
Code example
Let’s explore the implementation of the Smith-Waterman similarity measure using the py_stringmatching library.
from py_stringmatching.similarity_measure.smith_waterman import SmithWaterman
str1 = "Alexis"
str2 = "AliBaba"
sw = SmithWaterman(gap_cost=1.5)
smith_waterman_similarity = sw.get_raw_score(str1, str2)
print("Similarity function: ", sw.get_sim_func())
print("Smith-Waterman Similarity of \"{}\" and \"{}\" = {}".format(str1, str2, smith_waterman_similarity))Explanation
Line 1: We import the Smith-Waterman similarity measure class from the
py_stringmatchinglibrary.Line 5: We instantiate an
swobject of the Smith-Waterman similarity measure class.Line 7: We then measure the Smith-Waterman similarity of the
str1andstr2strings.Lines 9–10: The similarity function and the Smith-Waterman similarity of the strings are printed.
Free Resources