Analyzing the encoder states

Instead of relying just on the encoder’s last state, attention enables the decoder to analyze the complete history of state outputs. The decoder does this at every step of the prediction and creates a weighted average of all the state outputs depending on what it needs to produce at that step. For example, in the translation “I went to the shop” → “ich ging zum Laden,” when predicting the word “ging,” the decoder will pay more attention to the first part of the English sentence than the latter.

There have been many different implementations of attention over the years. It’s important to properly emphasize the need for attention in NMT systems. The context, or thought vector, that resides between the encoder and the decoder is a performance bottleneck:

Press + to interact

To understand why this is a bottleneck, let’s imagine translating the following English sentence:

I went to the flower market to buy some flowers.

This translates to the following:

Ich ging zum Blumenmarkt, um Blumen zu kaufen.

If we’re to compress this into a fixed-length vector, the resulting vector needs to contain these:

Information about the subject (“I”)
Information about the verbs (“buy” and “went”)
Information about the objects (“flowers” and “flower market”)
Interaction of the subjects, verbs, and objects with each other in the sentence

Generally, the context vector has a size of 128 or 256 elements. Reliance on the context vector to store all this information with a small-sized vector is very impractical and an extremely difficult requirement for the system. Therefore, most of the time, the context vector fails to provide the complete information required to make a good translation. This results in an underperforming decoder that suboptimally translates a sentence.

To make the problem worse, during decoding, the context vector is observed only in the beginning. Thereafter, the decoder GRU must memorize the context vector until the end of the translation. This becomes more and more difficult for long sentences.

Attention sidesteps this issue. With attention, the decoder will have access to the full state history of the encoder for each decoding time step. This allows the decoder to access a very rich representation of the source sentence. Furthermore, the attention mechanism introduces a softmax layer that allows the decoder to calculate a weighted mean of the past observed encoder states, which will be used as the context vector for the decoder. This allows the decoder to pay different amounts of attention to different words at different decoding steps.

The figure below shows a conceptual breakdown of the attention mechanism:

Press + to interact

Next, let’s look at how we can compute attention.

Computing attention

Now, let’s investigate the actual implementation of the attention mechanism in detail. For this, we’ll use the Bahdanau attention mechanism introduced in the paper Neural Machine Translation by Learning to Jointly Align and TranslateBahdanau et al.. We’ll discuss the original attention mechanism here. However, we’ll be implementing a slightly different version of it due to the limitations of TensorFlow. For consistency with the paper, we’ll use the following notations:

Encoder’s $j^{th}$ hidden state: $h_j$
$i^{th}$ target token: $y_i$
$i^{th}$ decode hidden state in the $i^{th}$ time step: $s_i$
Context vector: ...

Introduction to Natural Language Processing

Understanding TensorFlow 2

Word2vec: Learning Word Embeddings

Advanced Word Vector Algorithms

Sentence Classification with Convolutional Neural Networks

Recurrent Neural Networks

Understanding Long Short-Term Memory Networks

Applications of LSTM: Generating Text

Sequence-to-Sequence Learning: Neural Machine Translation

Transformers

Sarcasm Classification Using BERT

Image Captioning with Transformers

Caption Generation Using PyTorch

Final Remarks

Appendix: Mathematical Foundations and Advanced TensorFlow

Attention: Analyzing, Computing, and Implementing

Analyzing the encoder states

Computing attention