An Introduction to Entity Resolution in Python/

...

Sort-Based Indexing

Learn about the benefits and drawbacks of sorted neighborhood (SN) indexing and related techniques.

We'll cover the following...

When to use SN indexing
Math behind SN indexing
Windows over sort keys
Key takeaway

What if we could sort all records so that similar ones are nearby? We should compare only records that are close by position in the sorted array—for example, no more than the ten nearest neighbors per record, assuming that there are only a few duplicates per entity. That’s a computationally cheap approach to bringing down superfluous comparisons.

The key challenge here is figuring out how to sort in such a way that proximity translates to similarity that strongly correlates with actual matching likelihood.

When to use SN indexing

Sorted neighborhood (SN) indexing works as follows: first, pick any key attribute for sorting, such as customer_name, as in our example below. Second, rearrange records by the alphanumeric order of their sort key. Third, add each pair to the index where the records are at most $w$ positions apart, where the window size $w$ is up to us to choose.

Press + to interact

C++

import pandas as pd
def sn_index(s: pd.Series, window_size: int) -> pd.MultiIndex:
    """Returns MultiIndex of candidate pairs by Sorted Neighborhood."""
    # Sort the series of keys and add the position
    sorted = s.sort_values().to_frame().assign(_pos=range(restaurants.shape[0])).filter(['_pos'])
    df = pd.concat([
        # Creates frame of all index pairs l positions apart:
        sorted.reset_index().merge(sorted.shift(l).reset_index(), on='_pos', how='inner') for l in range(1, window_size+1)
    ])
    index_name = 'index' if s.index.name is None else s.index.name
    # Always have first > second for consistency reasons:
    df['_first'] = df[[f'{index_name}_x', f'{index_name}_y']].max(axis=1)
    df['_second'] = df[[f'{index_name}_x', f'{index_name}_y']].min(axis=1)
    return pd.MultiIndex.from_frame(df[['_first', '_second']])
restaurants = pd.read_csv('solvers_kitchen/restaurants.csv').set_index('customer_id')
print(sn_index(restaurants['customer_name'], window_size=5))

SN complements SBAdd a pair to the index if the key matches exactly. well. Typically, an attribute that serves well as a match key for SB does not work well as a sort key for SN. Think of blocking by a customer’s country, predestined for use in SB but not in SN—does it make sense to put pairs with records from Albania and Algeria into the index only because those two countries are neighbors by alphabetic sort?

Now, think of names instead as our key attribute. SB will likely not work well because names are prone to inconsistencies, typos, and other variations. On the other hand, SN is expected to catch many of those variations as long the inconsistency is not at the start of the name.

Because SN indexing is so cheap to compute and small in size, we can run SN multiple times using different sort keys—for example, sort by reversed names to catch similar endings, like in “Catherine” and “Katherine.”

Press + to interact

Introduction to Entity Resolution and Applications

A Quickstart Guide Using the RecordLinkage Package

Preprocessing

Indexing

Feature Engineering

Pairwise Matching

Clustering

Integration

Entity Resolution Fundamentals

Matching Products Across Two Online Shops

Conclusion

Appendix

Auto-Tagging System for Content Categorization

Sort-Based Indexing

When to use SN indexing

Math behind SN indexing