Undersampling with NearMiss
Explore how undersampling techniques like NearMiss address class imbalances in entity resolution tasks. Understand the impact of balancing datasets on model performance, improving recall in binary classification while preserving critical information close to decision boundaries. Learn strategies for building effective training datasets from unlabeled data using undersampling.
We'll cover the following...
Real-world entity resolution tasks are severely imbalanced classification problems, suboptimal for learning. In smaller datasets, we face ratios of one match per thousands of no-matches, and in medium- to large-scale datasets, the ratio is several magnitudes worse. Applying indexing techniques can reduce the imbalance to some extent.
Let’s explore how we can improve by applying undersampling on the following precomputed dataset of similarity features, also covering the ground truth in the class column:
We have 112 matches and 186114 no-matches in this dataset. That’s even moderate compared to many other entity resolution scenarios.
Balancing with minimal information loss
Our dataset contains 1662 no-matches for every single match. Undersampling means we preserve all 112 examples from the minority class while reducing the number of the majority class.
Size is only one dimension of the problem. We also want to minimize critical information loss while balancing the data. That’s the purpose of an undersampling algorithm, like NearMiss from the imbalanced-learn package.
The illustration below contains eight matches. Due to the monotonic nature of our features, we can expect a decision boundary dividing the feature space into an upper-right match zone vs. the ...