Introduction to Deep Learning & Neural Networks/

...

Self-Attention

Understand in depth how self-attention works.

We'll cover the following...

What is self-attention?

“Self-attention" is an attention mechanism relating different positions of a single sequence in order to compute a representation of the sequence.” ~ Ashish Vaswani et al. from Google Brain

Self-attention enables us to find correlations between different words (tokens) of the input indicating the syntactic and contextual structure of the sentence.

Let’s take the input sequence “Hello I love you” as an example.

A trained self-attention layer will associate the word “love” with the words “I” and “you” with a higher weight than the word “Hello”. From linguistics, we know that these words share a subject-verb-object relationship and that is an intuitive way to understand what self-attention will capture.

Self attention probability score matrix

In practice, the original transformer model uses three different representations of the embedding matrix: the Queries, Keys, and Values.

This can easily be implemented by multiplying our input ${X} \in R^{N \times d_{k} }$ with three different weight matrices ${W}_Q$ , ${W}_K$ and ${W}_V \in R^{ d_{k} \times d_{model}}$ ...

Learn Deep Learning

Neural Networks

Training Neural Networks

Convolutional Neural Networks

Recurrent Neural Networks

Autoencoders

Generative Adversarial Networks

Attention and Transformers

Graph Neural Networks

Conclusion

Final Quiz

Self-Attention