Machine Translation

Get an overview of machine translation and learn to perform it using Hugging Face.

Overview

Another excellent application of NLP is Machine Translation, where a text written in a natural language is automatically translated into another (natural) language. We’ve all experienced a significant and continuous improvement in the results produced by some common translators like Google and Bing. This improvement can be attributed to transformers, bigger datasets, and better models.

Machine translation has been one of the earliest intended applications of AI. Despite its initial success in games and some trivial tasks, AI was unable to perform machine translation and was one significant reason behind the first AI winterSimilar to nuclear winter, AI winter refers to the era of reduced funding and interest in AI research..

Translation with Hugging Face

The translation is another sequence-to-sequence task. We perform the translation in Hugging Face as follows:

Press + to interact
from transformers import pipeline
en_fr_translator = pipeline("translation_en_to_fr")
translator("It's a pleasant day.")

Choose a specific model

The example above uses the default translation model, T5-base. While T5 is a frequently used model, it is trained in only three languages, and consequently, we need some diverse models.

Hugging Face provides us the luxury of choosing among several translation models as well. As of August 2022, there are 1600+ models on translation alone. These are progressively increasing.

Note: Before using a specific model, we first need to install the text tokenizer/detokenizer SentencePieceSentencePiece is an unsupervised text tokenizer and detokenizer mainly for Neural Network-based text generation systems where the vocabulary size is predetermined prior to the neural model training..

To use a particular model, we specify it as follows:

Press + to interact
from transformers import pipeline
mandarianModel = "Helsinki-NLP/opus-mt-zh-en" #Most popular translation model on HF
translator = pipeline("translation", model=mandarianModel)
translator("All the variety, all the charm, all the beauty of life is made up of light and shadow.")

Note: Hugging Face allows us to override the default translation model

Non-native languages

These classical models are trained in English to some commonly used Indo-European languages. However, what about a language like Lhasa or even an Indo-European language (like Punjabi or Pashto) with many speakers but handicapped by a lack of trained models?

Fret not! We can use a pre-trained multilingual model in these scenarios and fine-tune it to the desired language.

Datasets

There are a couple of datasets available if we want to train or fine-tune a model of our own:

  • opus_books: This is a collection of (copyright-free) books translated in sixteen different languages.
  • code_x_glue_cc_code_to_code_trans: It provides some functions in Java and C# as a basic example of translation between programming languages.

Examples

Let’s run some working examples to wrap it up:

Please login to launch live app!