Transformer Networks

Learn about sequence-to-sequence (seq2seq) modeling and Transformer networks.

Sequence-to-sequence (seq2seq) modeling

Recurrent architectures such as RNNs have long dominated seq2seq modeling. These model architectures process the sequences, such as text, iteratively (i.e., one element at a time and in order). This sequential handling imposes a challenge when the model needs to learn long-range dependencies due to rising issues such as vanishing gradients. As the gap between relevant token elements increases, these models tend to lose track of learned sequences from early time steps, resulting in incomplete context understanding, which is highly necessary for language learning.

Let’s take a look at an example: “The cat that the dog chased ran up a tree.” This sentence contains long-range dependencies between the earlier (e.g., cat) and later (e.g., ran) words. The RNN will process this sentence iteratively (i.e., token-by-token) and needs to learn the long-range dependencies. In this case, the RNN may not be able to connect the relationship between “cat” and “ran” together since several words are present in between.

To solve this problem, how about we design a model that can process the entire sequence “The cat that the dog chased ran up a tree” in parallel and capture the relationship between all pairs of tokens in the given sequence—simultaneously. This is precisely what the Transformer model does. It models long-range dependencies across the entire sequence using the self-attention mechanism and computes the relationship between all pairs of tokens via dot-product attention.

Transformers

The transformer architecture was introduced in the paper “Attention is All You NeedVaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. "Attention is all you need." In Proceedings of the 31st International Conference on Neural Information Processing Systems, 6000-6010. Red Hook, NY: Curran Associates Inc..” This directly represents a departure from the sequential processing paradigm of previous models like RNNs and CNNs. The transformer relies on the concept of self-attention, a mechanism that allows the model to weigh the importance of different parts of the input data when making predictions. Self-attention mechanism computes attention scores between all elements in an input sequence (XX). Consider an input sequence represented as a set of vectors:

Get hands-on with 1200+ tech skills courses.