BERT Models for Italian and Portuguese

Learn about the architecture and different variants of the UmBERTo and BERTimbau.

UmBERTo for Italian

UmBERTo is the pre-trained BERT model for the Italian language by Musixmatch research. The UmBERTo model inherits the RoBERTa model architecture. The RoBERTa is essentially BERT with the following changes in pre-training:

  • Dynamic masking is used instead of static masking in the MLM task.

  • The NSP task is removed and trained using only the MLM task.

  • Training is undertaken with a large batch size.

  • Byte-level BPE is used as a tokenizer.

UmBERTo extends the RoBERTa architecture by using the SentencePiece tokenizer and WWM.

Variants of UmBERTo model

Researchers have released two pre-trained UmBERTo models:

  • umberto-wikipedia-uncased-v1: Trained on the Italian Wikipedia corpus.

  • umberto-commoncrawl-cased-v1: Trained on the CommonCrawl dataset.

The pre-trained UmBERTo models can be downloaded from GitHub. We can also use the pre-trained UmBERTo model with the transformers library, as shown here:

Get hands-on with 1200+ tech skills courses.