Learning from Labeled Examples
Explore how to train binary classification models for entity resolution using labeled examples. Learn to manage severe class imbalances, prepare similarity features, and apply monotonicity constraints with CatBoost to improve match detection. This lesson equips you to handle real-world entity resolution challenges with practical Python techniques.
We'll cover the following...
Entity resolution is a binary classification problem at its heart. For every pair of records, we must decide if that pair is a “match” (positive class) or a “no-match” (negative class). This lesson is about training a machine learning model using examples of pairs where we know the outcome.
We assume that learners have some experience with machine learning so that this lesson can focus on the specificities of entity resolution. In particular, what the typical features look like, how to train and evaluate with class imbalances of 1 to 10000 or worse, and incorporate monotonicity constraints into a classification model.
Preparing the North Carolina voters’ features
The dataset below is a lightweight version of the North Carolina voters’ open dataset. It comes with cross-references to know which records truly match by entity. We will use these to build our class ...