Undersampling with NearMiss

Explore how undersampling techniques like NearMiss address class imbalances in entity resolution tasks. Understand the impact of balancing datasets on model performance, improving recall in binary classification while preserving critical information close to decision boundaries. Learn strategies for building effective training datasets from unlabeled data using undersampling.

We'll cover the following...

Balancing with minimal information loss
Impact on learning
Cold start strategy
Key takeaway

Real-world entity resolution tasks are severely imbalanced classification problems, suboptimal for learning. In smaller datasets, we face ratios of one match per thousands of no-matches, and in medium- to large-scale datasets, the ratio is several magnitudes worse. Applying indexing techniques can reduce the imbalance to some extent.

Let’s explore how we can improve by applying undersampling on the following precomputed dataset of similarity features, also covering the ground truth in the class column:

We have 112 matches and 186114 no-matches in this dataset. That’s even moderate compared to many other entity resolution scenarios.

Balancing with minimal information loss

Our dataset contains 1662 no-matches for every single match. Undersampling means we preserve all 112 examples from the minority class while reducing the number of the majority class.

Size is only one dimension of the problem. We also want to minimize critical information loss while balancing the data. That’s the purpose of an undersampling algorithm, like NearMiss from the imbalanced-learn package.

The illustration below contains eight matches. Due to the monotonic nature of our features, we can expect a decision boundary dividing the feature space into an upper-right match zone vs. the ...

1.Introduction to Entity Resolution and Applications

2.A Quickstart Guide Using the RecordLinkage Package

3.Preprocessing

4.Indexing

5.Feature Engineering

6.Pairwise Matching

7.Clustering

8.Integration

Assessment

Mini Project

9.Conclusion

10.Appendix

Project

Undersampling with NearMiss

Balancing with minimal information loss