Robustly Optimized BERT pre-training Approach (RoBERTa) is another interesting and popular variant of BERT. Researchers observed that BERT is severely undertrained and proposed several approaches to pre-train the BERT model. RoBERTa is essentially BERT with the following changes in pre-training:

  • Use dynamic masking instead of static masking in the MLM task.

  • Remove the NSP task and train using only the MLM task.

  • Train with a large batch size.

  • Use byte-level BPE (BBPE) as a tokenizer.

Now, let's look into the details and discuss each of the preceding points.

Using dynamic masking instead of static masking

We learned that we pre-train BERT using the MLM and NSP tasks. In the MLM task, we randomly mask 15% of the tokens and let the network predict the masked token.

For instance, say we have the sentence 'We arrived at the airport in time'. Now, after tokenizing and adding [CLS] and [SEP] tokens, we have the following:

Get hands-on with 1200+ tech skills courses.