In this lesson, we’ll learn about the data and the process for preparing data for training and predicting from the NMT system. First, we’ll talk about how to prepare training data (that is, the source sentence and target sentence pairs) to train the NMT system, followed by inputting a given source sentence to produce the translation of the source sentence.

The dataset

The dataset we’ll be using is the WMT-14 English-German translation data. There are about 4.5 million sentence pairs available. However, we will use only 250,000 sentence pairs due to computational feasibility. The vocabulary consists of the 50,000 most common English words and the 50,000 most common German words, and words not found in the vocabulary will be replaced with a special token, <unk>. We’ll need to download the following files:

train.de and train.en contain parallel sentences in German and English, respectively. Once we download these, we’ll load the sentences as follows:

Get hands-on with 1200+ tech skills courses.