Transformers and Transfer Learning

Let's discuss the transformers and their impact on machine learning.

A milestone in NLP happened in 2017 with the release of the research paper Attention Is All You NeedVaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention Is All You Need. ArXiv./abs/1706.03762 by Vaswani et al., which introduced a brand-new machine learning idea and architecture—transformers. Transformers in NLP is a fresh idea that aims to solve sequential modeling tasks and targets some problems introduced by long short-term memory (LSTM) architecture (recall LSTM architectureLSTM). Here's how the paper explains how transformers work: "The Transformer is the first transduction model relying entirely on self-attention to compute representations of its input and output without using sequence-aligned RNNs or convolution."

Transduction in this context means transforming input words to output words by transforming input words and sentences into vectors. Typically, a transformer is trained on a huge corpus such as Wiki or news. Then, in our downstream tasks, we use these vectors as they carry information regarding the word semantics, sentence structure, and sentence semantics.

We already explored the idea of pre-trained word vectors earlier. Word vectors such as Glove and FastText vectors are already trained on the Wikipedia corpus, and we used them directly for our semantic similarity calculations. In this way, we imported information about word semantics from the Wiki corpus into our semantic similarity calculations. Importing knowledge from pre-trained word vectors or pre-trained statistical models is called transfer learning.

Transformers offer thousands of pre-trained models to perform NLP tasks, such as text classification, text summarization, question answering, machine translation, and natural language generation in more than 100 languages. Transformers aim to make state-of-the-art NLP accessible to everyone.

To understand what's great about transformers, we'll first revisit LSTM architecture. We have already stepped into the statistical modeling world with Keras and LSTM architecture. LSTMs are great for modeling text; however, they have some shortcomings too:

  • LSTM architecture sometimes has difficulties with learning long text. Statistical dependencies in a long text can be difficult to represent by an LSTM because, as the time steps pass, LSTM can forget some of the words that were processed at earlier time steps.

  • The nature of LSTMs is sequential. We process one word at each time step. Obviously, parallelizing the learning process is not possible; we have to process sequentially. Not allowing parallelization creates a performance bottleneck.

Transformers address these problems by not using recurrent layers at all. If we have a look at the following, the architecture looks completely different from an LSTM architecture. Transformer architecture consists of two parts—an input encoder (called the Encoder) block on the left and the output decoder (called the Decoder) block on the right. The following diagram is taken from this paper and exhibits the transformer architecture:

Get hands-on with 1200+ tech skills courses.