Introduction to Deep Learning & Neural Networks/

...

Attention

Get to know Attention, which is one of the most important formulas of deep learning.

We'll cover the following...

Attention was born in order to address the limitations of Seq2Seq models.

The core idea is that the context vector $z$ should have access to all parts of the input sequence instead of just the last one.

In other words, we need to form a direct connection with each timestep.

This idea was originally proposed for computer vision. It was initially conceptualized like this: by looking at different parts of the image (glimpses), we can learn to accumulate information about a shape and classify the image accordingly.

The same principle was later extended to sequences. We can look at all the different words at the same time and learn to “pay attention“ to the correct ones depending on the task at hand.

This is what we now call attention. Attention is simply a notion of memory gained from attending at multiple inputs through time.

Let’s see it in action.

Attention in the encoder-decoder example

In the encoder-decoder RNN case, given previous state in the decoder as ${y}_{i-1}$ and the hidden state ${h} = [h_1,h_2, ... , h_{n} ]$ , we have something like this:

{e}_{i}={attention}\left({y}_{i-1}, {h} \right) \in R{^n}

The index $i$ indicates the prediction step. Essentially, we define a score (weighting) between the hidden state of the decoder and all the hidden states of the encoder.

More specifically, for each hidden state (denoted by $j$ ) ${h}_1,{h}_2,..., {h}_n$ , we will calculate a scalar:

e_{i j}={attention_{net}}\left({y}_{i-1}, h_{j}\right)

Learn Deep Learning

Neural Networks

Training Neural Networks

Convolutional Neural Networks

Recurrent Neural Networks

Autoencoders

Generative Adversarial Networks

Attention and Transformers

Graph Neural Networks

Conclusion

Final Quiz

Attention

Attention in the encoder-decoder example