Similarity Features
Become familiar with the RecordLinkage API for engineering similarity features.
We'll cover the following...
RecordLinkage follows the following two main steps:
- Indexing: Select which pairs of records are duplicate candidates and therefore should be compared.
- Scoring: Configure and compute a vector of similarity functions for every pair in the index.
All-in indexing
We keep it simple here and add every possible pair to the index—a “full” index in the RecordLinkage terminology.
Every element in the index is a pair of the customer_id values. The recordlinkage API warns us from using a full index, which can get very expensive computationally. That’s nothing we need to worry about now because of the small size of the data. The size of the full index is a simple function of the sample size.
That’s roughly 373k pairs, which we will process in just a few seconds.
Measuring similarity
Our data below contains seven preprocessed attributes—clean and phonetic versions of the original data’s customer names, cities, and streets, and just clean phone numbers. We configure one similarity function per attribute.
The recordlinkage API has several built-in similarity functions. We have good reasons to choose different methods for different attributes.
The