An Introduction to Entity Resolution in Python/

...

Equality-Based Indexing

Become familiar with different methods of equality-based indexing and associated challenges.

We'll cover the following...

Standard blocking (SB)
Math behind SB
Quality issues in the match keys
Robust variations
- Key takeaway

The typical record consists of several attributes: names, addresses, transaction dates, prices, sizes, colors, etc. We expect duplicates to be similar across most attributes. For some attributes, we even expect an exact match—for example, duplicate customer records will unlikely have different country attributes.

Note: The restaurants dataset we use below is open data. See the Glossary of the course for attribution and references.

Standard blocking (SB)

SB, or “blocking,” is so prevalent that indexing is often used as a synonym for this technique. If not stated otherwise explicitly, people mean SB when they talk about indexing on a particular attribute.

Below, we read the restaurants dataset and use recordlinkage to block by the city attribute:

Press + to interact

C++

import pandas as pd
import recordlinkage as rl
restaurants = pd.read_csv('solvers_kitchen/restaurants.csv').set_index('customer_id')
print('Columns in dataset: ', restaurants.columns.values)
# Full index = all possible pairs:
indexer_1 = rl.Index()
indexer_1.full()
all_pairs = indexer_1.index(restaurants)
# Only pairs where both records match by `city`
indexer_2 = rl.Index()
indexer_2.block('city')
pairs_blocked_by_city = indexer_2.index(restaurants)
print('# all pairs: ', all_pairs.shape[0])
print('# pairs blocked by city: ', pairs_blocked_by_city.shape[0])
print('-> reduction rate: ', 100 - 100 * pairs_blocked_by_city.shape[0] / all_pairs.shape[0], '%')

Blocking by city means we segment records into disjoint subsets, one segment per the city value in the data. Only pairs with both records in the same segment—in other words, equal city—proceed with similarity scoring.

It is an effective way to avoid many likely nonmatching pairs but also comes at the risk of missing some matches.

Math behind SB

Let $n$ , $b$ , and $n_i$ denote the total sample size, number of blocks, and sample size of the $i$ ^thblock so that $n=n_1+...+n_b$ ...

Introduction to Entity Resolution and Applications

A Quickstart Guide Using the RecordLinkage Package

Preprocessing

Indexing

Feature Engineering

Pairwise Matching

Clustering

Integration

Entity Resolution Fundamentals

Matching Products Across Two Online Shops

Conclusion

Appendix

Auto-Tagging System for Content Categorization

Equality-Based Indexing

Standard blocking (SB)

Math behind SB