Equality-Based Indexing
Explore equality-based indexing to improve entity resolution efficiency by segmenting records with exact matching keys. Understand standard blocking to reduce comparisons, identify data quality issues in keys like city names, and apply suffix arrays for robust matching despite variations. This lesson helps you select appropriate indexing strategies to optimize deduplication tasks in Python.
We'll cover the following...
The typical record consists of several attributes: names, addresses, transaction dates, prices, sizes, colors, etc. We expect duplicates to be similar across most attributes. For some attributes, we even expect an exact match—for example, duplicate customer records will unlikely have different country attributes.
Note: The
restaurantsdataset we use below is open data. See the Glossary of the course for attribution and references.
Standard blocking (SB)
SB, or “blocking,” is so prevalent that indexing is often used as a synonym for this technique. If not stated otherwise explicitly, people mean SB when they talk about indexing on a particular attribute.
Below, we read the restaurants dataset and use recordlinkage to block by the city attribute:
Blocking by city means we segment records into disjoint subsets, one segment per the city value in the data. Only pairs with both records in the same segment—in other words, equal city—proceed with similarity scoring.
It is an effective way to avoid many likely nonmatching pairs but also comes at the risk of missing some matches.
Math behind SB
Let