The Data Augmentation Methods

Learn different methods to perform task-agnostic data augmentation.

We'll cover the following

Understanding the masking method
Understanding the POS-guided word replacement method
Understanding the n-gram sampling method
The data augmentation procedure
Data augmentation for sentence pairs

We use the following methods for performing task-agnostic data augmentation:

Masking
POS-guided word replacement
n-gram sampling

Let's take a look at each one of them.

Understanding the masking method

In the masking method, with probability $p_{\text{mask}}$ , we randomly mask a word in the sentence with the [MASK] token and create a new sentence with the masked token. For instance, suppose we are performing a sentiment analysis task and, say in our dataset, we have the sentence 'I was listening to music'. Now, with probability $p_{\text{mask}}$ , we randomly mask a word. Say we have masked the word 'music', then we have a new sentence: 'I was listening to [MASK]'.

But how is this useful? With the [MASK] token in the sentence, our model will not be able to produce the confidence logits since [MASK] is an unknown token. Our model produces less confident logits for the sentence 'I was listening to [MASK]' with a [MASK] token than for the sentence 'I was listening to music' with the unmasked token. This helps our model understand the contribution of each word to the label.

Understanding the POS-guided word replacement method

In the POS-guided (parts of speech guided) word replacement method, with probability $p_{\text{pos}}$ , we replace a word in a sentence with another word but with the same parts of speech.

For example, consider the sentence 'Where did you go?' We know that in this sentence, the word 'did' is a verb. Now we can replace the word 'did' with another verb. So now our sentence becomes 'where do you go?' As you can see, we replaced the word 'did' with 'do' and obtained a new sentence.

Understanding the n-gram sampling method

In the n-gram sampling method, with probability $p_{\text{ng}}$ , we just randomly sample an n-gram from a sentence, and the value of n is chosen randomly from 1 to 5.

We've learned three different methods for data augmentation. Now let's explore how we exactly apply them.

The data augmentation procedure

Say we have a sentence — 'Paris is a beautiful city'. Let $w_1, w_2, ..., w_i, ..., w_n$ be the words in the sentence. Now, for each word, $w_i$ , in our sentence, we create a variable called $X_i$ where the value of $X_i$ is randomly sampled from the uniform distribution $X_i \sim \text{Uniform}(0, 1)$ . Based on the value of $X_i$ , we do the following:

If $X_i < p_{\text{mask}}$ , then we mask the word $w_i$ .
If $p_{\text{mask}} ≤ X_i < p_{\text{mask}} + p_{\text{pos}}$ , then we apply POS-guided word replacement.

Note that masking and POS-guided word replacement are mutually exclusive; if we apply one, then we can't apply the other.

After the preceding step, we will obtain a modified sentence (a synthetic sentence). Now, with probability $p_{\text{ng}}$ , we apply n-gram sampling to our synthetic sentence and obtain a final synthetic sentence. Then we append the final synthetic sentence to a data_aug list.

For every sentence, we perform the preceding steps $N$ number of times and obtain $N$ new synthetic sentences. Okay, but if we have sentence pairs instead of sentences, then how can we obtain the synthetic sentence pairs?

Data augmentation for sentence pairs

For sentence pairs, we can create synthetic sentence pairs in a number of ways. Some of these are as follows:

We can create a synthetic sentence only from the first sentence and hold the second sentence.
We hold the first sentence and create a synthetic sentence only from the second sentence.
We can create synthetic sentences from both the first and second sentences.

In this way, we can apply the data augmentation method and obtain more data points. Then, we train our student network with augmented data points.

Get hands-on with 1200+ tech skills courses.

Before we Start

Starting Off with BERT

A Primer on Transformers

Understanding the BERT Model

Getting Hands-On with BERT

Exploring BERT Variants

Different BERT Variants

BERT Variants—Based on Knowledge Distillation

Applications of BERT

Exploring BERTSUM for Text Summarization

Applying BERT to Other Languages

Exploring Sentence and Domain-Specific BERT

Working with VideoBERT, BART, and More

Conclusion