Pre-Training Procedure

Explore the pre-training procedure of Google BERT, including how datasets are prepared with sentence sampling, tokenization using WordPiece, and token masking. Understand the simultaneous training on masked language modeling and next sentence prediction tasks. Learn about optimization techniques like the Adam optimizer, learning rate scheduling with warm-up steps, dropout application, and the GELU activation function for effective BERT training.

We'll cover the following...

Preparing the dataset

BERT is pre-trained using Toronto BookCorpus and the Wikipedia dataset. We have also learned that BERT is pre-trained using masked language modeling (cloze task) and the NSP task. Now, how do we prepare the dataset to train BERT using these two tasks?

Preparing the dataset

First, we sample two sentences (two text spans) from the corpus. Let's say we sampled two sentences, A and B. The sum of the total number of tokens from the two sentences A and B should be less than or equal to 512. While sampling two sentences (two text ...

1.Before We Start

2.Starting Off with BERT

3.A Primer on Transformers

Project

4.Understanding the BERT Model

5.Getting Hands-On with BERT

6.Exploring BERT Variants

7.Different BERT Variants

8.BERT Variants—Based on Knowledge Distillation

9.Applications of BERT

10.Exploring BERTSUM for Text Summarization

11.Applying BERT to Other Languages

12.Exploring Sentence and Domain-Specific BERT

13.Working with VideoBERT, BART, and More

14.Conclusion

Project

Pre-Training Procedure

Preparing the dataset