Targeting Uncertainty with Active Learning

Become familiar with active learning and how it helps improve binary classification.

The costs in real-world entity resolution can grow out of control for several reasons, one being creating labels for model training. The number of available record pairs to choose from for labeling is massive. How can we make the best out of our typically limited budget?

In this lesson, we start from an existing training dataset and explore how to add a small batch of labels, keeping the added costs low while maximizing the classification model performance improvements.

Training with existing labels

The voters dataset has a moderate size—still small compared to many real-world tasks. We load the data, remove the most likely trivial pairs, and create four similarity features we will use for model training. Note also that this dataset comes with cross-references telling us which pairs match, so we can start from a set of labels and test the approach here.

Get hands-on with 1200+ tech skills courses.