Pre-Training Procedure

BERT is pre-trained using Toronto BookCorpus and the Wikipedia dataset. We have also learned that BERT is pre-trained using masked language modeling (cloze task) and the NSP task. Now, how do we prepare the dataset to train BERT using these two tasks?

Preparing the dataset

First, we sample two sentences (two text spans) from the corpus. Let's say we sampled two sentences, A and B. The sum of the total number of tokens from the two sentences A and B should be less than or equal to 512. While sampling two sentences (two text spans), for 50% of the time, we sample sentence B as the follow-up sentence to sentence A, and for the other % of the time, we sample sentence B as not being the follow-up sentence to sentence A.

Suppose we sampled the following two sentences:

Get hands-on with 1200+ tech skills courses.