Evaluate the Match Quality
Review classification errors and learn how to improve a matching model by example.
We'll cover the following...
The restaurants dataset below is open data. See the glossary in this course's appendix for attribution. The dataset's class column resolves the data, telling us which records belong to the same entity and which do not.
import pandas as pdclasses = pd.read_csv('solvers_kitchen/classes.csv')print(classes.head())
We transform this classes cross-reference table to a pandas MultiIndex object of matches—the same format we use for the predicted_matches object, which represents our predicted matches:
from itertools import combinationsfrom typing import Uniondef cross_ref_to_index(df: pd.DataFrame, id_column: str, match_key_columns: Union[str, list[str]]) -> pd.MultiIndex:match_lists = df.sort_values(id_column, ascending=False).groupby(match_key_columns)[id_column].apply(lambda s: list(s))match_lists = match_lists.loc[match_lists.apply(lambda s: len(s)) > 1]match_pairs = []for match_list in match_lists:match_pairs += list(combinations(match_list, 2))return pd.MultiIndex.from_tuples(match_pairs)true_matches = cross_ref_to_index(df=classes, id_column='customer_id', match_key_columns='class')print('First three examples:')print(true_matches[:3])
This way, we can easily compare true_matches with predicted_matches and evaluate the matching quality.
Evaluation metrics
Predicting match vs. no-match is a binary classification problem. Those familiar with classification know that a simple accuracy won’t work here due to the heavy imbalance—we have many more no-matches than matches in a typical scenario. The entity resolution literature prefers precision and recall.
- ...