Masked Language Modeling

Learn about using masked language modeling and whole word masking techniques to pre-train the BERT model.

We'll cover the following

Training the BERT model for MLM task
Whole word masking

BERT is an auto-encoding language model, meaning that it reads the sentence in both directions to make a prediction. In a masked language modeling task, in a given input sentence, we randomly mask 15% of the words and train the network to predict the masked words. To predict the masked words, our model reads the sentence in both directions and tries to predict the masked words.

Training the BERT model for MLM task

Let's understand how masked language modeling works with an example. Let's take the same sentences we saw earlier: 'Paris is a beautiful city', and 'I love Paris'.

Tokenize the sentence

First, we tokenize the sentences and get the tokens, as shown here:

Get hands-on with 1200+ tech skills courses.

Before we Start

Starting Off with BERT

A Primer on Transformers

Understanding the BERT Model

Getting Hands-On with BERT

Exploring BERT Variants

Different BERT Variants

BERT Variants—Based on Knowledge Distillation

Applications of BERT

Exploring BERTSUM for Text Summarization

Applying BERT to Other Languages

Exploring Sentence and Domain-Specific BERT

Working with VideoBERT, BART, and More

Conclusion

Masked Language Modeling

Training the BERT model for MLM task

Tokenize the sentence