Edit- and Substring-Based Similarity
Explore edit distance and substring-based similarity functions such as Levenshtein, Jaro, and longest common substring to understand how they quantify text similarity for entity resolution tasks. Discover how to apply these methods using Python libraries and their strengths for different data types.
We'll cover the following...
As humans, we understand that “Robert Schwarz,” “Rob Shwarts,” “Bob Shvarts,” and “Schwaz, Robert” are suspiciously similar. Can we also compute scores programmatically that represent our human perception?
Let’s explore several similarity functions for texts based on edit distances or common substrings. A third class of text similarities based on vectorization is out of scope here. We use the following toy dataset here:
With this dataset, we illustrate the characteristics of the different similarity functions and how to use them with the recordlinkage API.
Overview of string similarity functions
The large variety of string similarity functions can be overwhelming. Below, we give a brief introduction of a shortlist:
Levenshtein: Levenshtein counts the number of ...