Search⌘ K
AI Features

Pre-Training Procedure

Explore the pre-training procedure of Google BERT, including how datasets are prepared with sentence sampling, tokenization using WordPiece, and token masking. Understand the simultaneous training on masked language modeling and next sentence prediction tasks. Learn about optimization techniques like the Adam optimizer, learning rate scheduling with warm-up steps, dropout application, and the GELU activation function for effective BERT training.

BERT is pre-trained using Toronto BookCorpus and the Wikipedia dataset. We have also learned that BERT is pre-trained using masked language modeling (cloze task) and the NSP task. Now, how do we prepare the dataset to train BERT using these two tasks?

Preparing the dataset

First, we sample two sentences (two text spans) from the corpus. Let's say we sampled two sentences, A and B. The sum of the total number of tokens from the two sentences A and B should be less than or equal to 512. While sampling two sentences (two text ...