The XLM-R Model

Learn about the XLM-R model, configurations for pre-training the model, and its evaluation.

The XLM-RoBERTa (XLM-R) model is basically an extension of XLM with a few modifications to improve performance. It is a state-of-the-art model for learning cross-lingual representation.

Pre-training the XLM-R model

The XLM is trained with MLM and TLM tasks. The MLM task uses the monolingual dataset, and the TLM task uses the parallel dataset. However, obtaining this parallel dataset is difficult for low-resource languages. So, in the XLM-R model, we train the model only with the MLM objective, and we don't use the TLM objective. Thus, the XLM-R model requires only a monolingual dataset.

XLM-R is trained on a huge dataset whose size is 2.5 TB. The dataset is obtained by filtering the unlabeled text of 100 languages from the CommonCrawl dataset. We also increase the proportion of small languages in our dataset through sampling. The following diagram provides a comparison of the corpus size of the CommonCrawl and Wikipedia datasets. We can observe that CommonCrawl has a large data size compared to Wikipedia, especially for low-resource languages:

Get hands-on with 1200+ tech skills courses.