Search⌘ K
AI Features

Sort-Based Indexing

Explore how sort-based indexing reduces computational cost in entity resolution by sorting records so similar ones are near. Learn the concept of sorted neighborhood indexing, window size effects, and how multiple passes improve matching accuracy while keeping index size small.

What if we could sort all records so that similar ones are nearby? We should compare only records that are close by position in the sorted array—for example, no more than the ten nearest neighbors per record, assuming that there are only a few duplicates per entity. That’s a computationally cheap approach to bringing down superfluous comparisons.

The key challenge here is figuring out how to sort in such a way that proximity translates to similarity that strongly correlates with actual matching likelihood.

When to use SN indexing

Sorted neighborhood (SN) indexing works as follows: first, pick any key attribute for sorting, such as customer_name, as in our example below. Second, rearrange records by the alphanumeric order of their sort key. Third, add each pair to the index where the records are at most ww positions apart, where the window size ww is up to us to choose.

C++
import pandas as pd
def sn_index(s: pd.Series, window_size: int) -> pd.MultiIndex:
"""Returns MultiIndex of candidate pairs by Sorted Neighborhood."""
# Sort the series of keys and add the position
sorted = s.sort_values().to_frame().assign(_pos=range(restaurants.shape[0])).filter(['_pos'])
df = pd.concat([
# Creates frame of all index pairs l positions apart:
sorted.reset_index().merge(sorted.shift(l).reset_index(), on='_pos', how='inner') for l in range(1, window_size+1)
])
index_name = 'index' if s.index.name is None else s.index.name
# Always have first > second for consistency reasons:
df['_first'] = df[[f'{index_name}_x', f'{index_name}_y']].max(axis=1)
df['_second'] = df[[f'{index_name}_x', f'{index_name}_y']].min(axis=1)
return pd.MultiIndex.from_frame(df[['_first', '_second']])
restaurants = pd.read_csv('solvers_kitchen/restaurants.csv').set_index('customer_id')
print(sn_index(restaurants['customer_name'], window_size=5))

SN complements ...