Oversampling with Text Augmentation
Learn how to improve the diversity of the training data by creating artificial matches using text augmentation.
We discuss the following two common issues with training data in this lesson:
Usually, our training datasets contain many examples of no-matches and only a few matches. In machine learning jargon, this is a severe class imbalance between the majority (no-matches) and minority classes (matches).
The few examples from the minority class do not cover all class-invariant transformations well, which we have seen in similar tasks (prior knowledge). Our model will not generalize well to unseen examples.
Let’s see how text augmentation can help reveal such problems using the following dataset of restaurant records:
The first lines of the output show examples of matching pairs and how they vary in names and streets.
Testing performance on seen data
The restaurants dataset comes with the ground truth so that we can experiment and evaluate our work. It covers 112 actual matches among the 372816 pairs of records. Let’s assume we have already reviewed every pair, so the entire dataset is available for training. We aim to train a model that generalizes well to unseen records, in other words, to new records not covered by this data.
Below, we load precomputed similarity scores across the three dimensions and fit a binary classification model. We use the CatBoost ...