Summary: Different BERT Variants
Let’s summarize what we have learned so far.
We'll cover the following
Key highlights
Summarized below are the main highlights of what we've learned in this chapter.
We learned how the ALBERT works. We learned that ALBERT is a lite version of BERT, and it uses two interesting parameter reduction techniques, called cross-layer parameter sharing and factorized embedding parameterization. We also learned about the SOP task used in ALBERT. We learned that SOP is a binary classification task where the goal of the model is to classify whether the given sentence pair is swapped or not.
We looked into the RoBERTa model. We learned that the RoBERTa is a variant of BERT and it uses only the MLM task for training. Unlike BERT, it uses dynamic masking instead of static masking, and it is trained with a large batch size. It uses BBPE as a tokenizer, and it has a vocabulary size of 50,000.
Get hands-on with 1200+ tech skills courses.