Summary: Different BERT Variants

Let’s summarize what we have learned so far.

We'll cover the following

Key highlights

Summarized below are the main highlights of what we've learned in this chapter.

  • We learned how the ALBERT works. We learned that ALBERT is a lite version of BERT, and it uses two interesting parameter reduction techniques, called cross-layer parameter sharing and factorized embedding parameterization. We also learned about the SOP task used in ALBERT. We learned that SOP is a binary classification task where the goal of the model is to classify whether the given sentence pair is swapped or not.

  • We looked into the RoBERTa model. We learned that the RoBERTa is a variant of BERT and it uses only the MLM task for training. Unlike BERT, it uses dynamic masking instead of static masking, and it is trained with a large batch size. It uses BBPE as a tokenizer, and it has a vocabulary size of 50,000.

Get hands-on with 1200+ tech skills courses.