Similarity Features
Explore how to apply similarity functions in Python for entity resolution using the RecordLinkage package. Understand indexing, configuring similarity scores for attributes, and evaluating match thresholds to improve duplicate detection in data records.
We'll cover the following...
RecordLinkage follows the following two main steps:
- Indexing: Select which pairs of records are duplicate candidates and therefore should be compared.
- Scoring: Configure and compute a vector of similarity functions for every pair in the index.
All-in indexing
We keep it simple here and add every possible pair to the index—a “full” index in the RecordLinkage terminology.
Every element in the index is a pair of the customer_id values. The recordlinkage API warns us from using a full index, which can get very expensive computationally. That’s nothing we need to worry about now because of the small size of the data. The size of the full index is a simple function of the sample size.
That’s roughly 373k pairs, which we will process in just a few seconds.
Measuring similarity
Our data below contains seven preprocessed attributes—clean and phonetic versions of the original data’s customer names, cities, and streets, and just clean phone numbers. We configure one similarity function per attribute.
The recordlinkage API has several built-in similarity functions. We have good reasons to choose different methods for different attributes.
The