An Introduction to Entity Resolution in Python/

...

Manual Review and Labeling

Learn how to enhance the review experience with the help of clustering.

We'll cover the following...

Humans in the loop
A simple web app for manual review
Build or buy?
Key takeaway

Clustering is a critical step in every entity resolution pipeline. Most importantly, it resolves conflicts from pairwise matching and enables us to build a cross-reference table.

We can stop after clustering if we are satisfied with the resolution quality, or we can start another training cycle with the help of some manual review—the topic of this lesson.

Humans in the loop

The following figure shows one of many possible entity resolution workflows, with two (optional) spots for humans in the loop.

Press + to interact

The training data consists of record pairs labeled as a match or no-match, which we can use to fit a binary classification model and predict classes of unlabeled pairs. We can step into two optional cycles that allow us to improve the training data iteratively and, ultimately, the quality of the resolution.

Classical labeling follows right after the pairwise prediction step, reviewing prediction by prediction successively. This lesson demonstrates the advantage of clustering before review. This will allow us to review several likely matches simultaneously so that we spend less time on individual pairs.

We can either check a cluster of, for example, ...

Introduction to Entity Resolution and Applications

A Quickstart Guide Using the RecordLinkage Package

Preprocessing

Indexing

Feature Engineering

Pairwise Matching

Clustering

Integration

Entity Resolution Fundamentals

Matching Products Across Two Online Shops

Conclusion

Appendix

Auto-Tagging System for Content Categorization

Manual Review and Labeling

Humans in the loop