Learning Multilingual Embeddings Through Knowledge Distillation

Learn how to apply Student-BERT for languages other than English using the teacher-student architecture.

Let's understand how to make the monolingual sentence embedding multilingual through knowledge distillation. We learned how M-BERT, XLM, and XLM-R work and how they produce representations for different languages. In all these models, the vector space between languages is not aligned. That is, the representation of the same sentence in different languages will be mapped to different locations in the vector space. Now, we will see how to map similar sentences in different languages to the same location in the vector space.

We learned how Sentence-BERT works. We learned how Sentence-BERT generates the representation of a sentence. But how do we use the Sentence-BERT for languages other than English?

Sentence-BERT for other languages

We can apply Sentence-BERT for different languages by making the monolingual sentence embedding generated by Sentence-BERT multilingual through knowledge distillation. To do this, we transfer the knowledge of Sentence-BERT to any multilingual model, say, XLM-R, and make the multilingual model generate embeddings just like pre-trained Sentence-BERT. Let's explore this in more detail.

The XLM-R model generates embeddings for 100 different languages. Now, we take the pre-trained XLM-R model and teach the XLM-R model to generate sentence embeddings for different languages just like Sentence-BERT. We use the pre-trained Sentence-BERT as the teacher and the pre-trained XLM-R as the student model.

Say we have a source sentence in English and the corresponding target sentence in French: [How are you, Comment ça va]. First, we will feed the source sentence to the teacher (Sentence-BERT) and get the sentence representation. Next, we feed both the source and target sentences to the student (XLM-R) and get the sentence representations, as shown in the following figure:

Get hands-on with 1200+ tech skills courses.