...

Pre-Training Procedure

Learn about the pre-training procedure of the BERT.

We'll cover the following...

Preparing the dataset

BERT is pre-trained using Toronto BookCorpus and the Wikipedia dataset. We have also learned that BERT is pre-trained using masked language modeling (cloze task) and the NSP task. Now, how do we prepare the dataset to train BERT using these two tasks?

Preparing the dataset

First, we sample two sentences (two text spans) from the corpus. Let's say we sampled two sentences, A and B. The sum of the total number of tokens from the two sentences A and B should be less than or equal to 512. While sampling two sentences (two text spans), for 50% of the time, we sample sentence B as the follow-up sentence to sentence A, and for the other % of the time, we sample sentence B as not being the follow-up sentence to sentence A.

Suppose we sampled the following two sentences:

Before We Start

Starting Off with BERT

A Primer on Transformers

Understanding the BERT Model

Getting Hands-On with BERT

Exploring BERT Variants

Different BERT Variants

BERT Variants—Based on Knowledge Distillation

Applications of BERT

Exploring BERTSUM for Text Summarization

Semantic Search with Transformers

Applying BERT to Other Languages

Exploring Sentence and Domain-Specific BERT

Working with VideoBERT, BART, and More

Conclusion

Similarity Detection in English Language Using RoBERTa

Pre-Training Procedure

Preparing the dataset

Tokenize the sentence

Masking the tokens

Training the BERT