Search⌘ K
AI Features

Overview of Transformers

Learn how transformer models work and their role in conversational AI. Understand key concepts like tokenization, embeddings, attention mechanisms, and decoding to develop effective chatbots. Gain insight into selecting and deploying transformer-based large language models to create responsive, context-aware chatbot systems.

Transformer models in conversational AI

Over the last decades, multiple developments in the field of natural language processing (NLP) have resulted in achieving large language models (LLMs) and, in particular, the introduction of transformers. Transformers were introduced in the “Attention is All You Need” paper in 2017 by Ashish Vaswani et al.

Transformers revolutionized the field of deep learning, offering a modern architecture that outperforms the recurrent neural networks (RNNs) and long short-term memory (LSTM) networks which were widely used in deep learning. This architecture not only simplifies the structure of neural networks but also significantly reduces training time.

The evolution of NLP through time
The evolution of NLP through time

Deep Neural Networks had already been in development for decades. In the 1990s, RNNs (recurrent neural networks) were conceived. A couple of years later, LSTMs (long short-term memory networks) were introduced in 1997. The concept of the basic attention mechanism became popular and utilized in neural network architectures around 2014, and it helped in improving the performance of various sequential models, including RNNs, LSTMs, and GRUs (Gated Recurrent Units). The transformer model was introduced in the paper “Attention is All You Need” in 2017. BERT (Bidirectional Encoder Representations from Transformers) was released by researchers at Google in 2018, and it became one of the first models to utilize the transformer architecture for NLP tasks. Transformer models are widely utilized, with many adaptations and improvements in 2018. Models such as GPT, T5, and others leverage and demonstrate the flexibility and effectiveness of the architecture. Transformers are utilized extensively in generative AI, as of 2020, with models such as GPT-3 showing amazing capabilities for generating human-like text.

Basically, Transformers process text by tokenizing words. Tokenization is the process of converting text into smaller units, or tokens, such as words or sub-words. This step is crucial for transforming natural language into a format that the model can process. These tokens are then transformed into vector representations using word embedding tables, allowing the model to understand and generate text. Transformers are implemented in many applications that we utilize on a daily basis, such as text completion features in smartphone messaging apps (next-word prediction and auto-correction).

Device keyboard
Device keyboard

Once the text is embedded, the attention mechanism within the transformer model processes and interprets the input data, offering a more nuanced understanding and text generation capability. Essentially, the attention mechanism allows the model to focus on different parts of the input data when generating each word in the output by paying attention to the most relevant word at each step of the sequence. This is achieved by calculating how much importance each word in the input sequence should receive relative to other words when predicting a specific word in the output. The self-attention mechanism utilizes sets of queries, keys, and values derived from the input data to perform this calculation. As a result, transformers can understand context and the relationships between words. This ability to allocate attention across the input sequence allows transformers to generate responses that enhance the quality of interaction in applications such as chatbots.

The output of the self-attention mechanism is then passed through a feed-forward neural network to process that data before contributing the final output. In practical applications, such as when composing messages in a messaging app, a couple of words are suggested to the user. Under the hood, the sentence is sent to a neural network that predicts the next possible words with a probability vector, as shown below.

Neural network
Neural network

This predictive capability stemming from the transformer’s ability to weigh the context and relevance of each word in the sequence, allows for the generation of contextually relevant suggestions, enhancing the user experience.

Understanding transformer architecture

Although the transformer architecture is less complex to understand than recurrent neural networks, it still consists of many blocks and layers, with each component comprising several more layers. Below is the famous transformer architecture:

Transformers architecture
Transformers architecture

To understand transformers, we need to separate their architecture into two major blocks: the encoder (on the left side of the preceding picture) and the decoder (on the right side).

The encoder

  1. The text is sent to the transformer model.

  2. The text is encoded using tokenization and embedding methods.

  3. Positional encoding is applied to the previous output vector to keep the order of the words in the sentence or paragraph.

  4. Self-attention using query, key, and value vectors is performed on the positional encoded vectors. The dot product is taken between queries and keys to produce a score (refer to image 1 below), which is then scaled and passed through a softmax function to create attention weights (refer to image 2 below). Weights are used to create a weighted sum of the value vectors. Mathematically, this can be represented as:

Attention(Q,K,V)=softmax(QKTdk)V\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V

  1. The output vectors from the self-attention mechanism are added to the original input vectors through a residual connection. A residual connection simply adds the input of a sub-layer (such as a self-attention or feedforward neural network layer) to its output, allowing the gradient to flow directly through the network. This mitigates the risk of vanishing gradients during back-propagation and enables deeper models to learn effectively. Layer normalization is then applied to this combined output.

  2. The normalized output vectors are passed onto a feedforward neural network (refer to image 3 below) to process the tokens, and an output vector for each token that represents a transformed feature space is produced. The output is then again added to the original input vectors through a residual connection. Layer normalization is then applied to this combined output.

Image 1: Multi-head attention model
Image 1: Multi-head attention model
Image 2: Scaled dot-product attention model
Image 2: Scaled dot-product attention model
Image 3: Feedforward network
Image 3: Feedforward network

The decoder

  1. The output text from the previous step is fed into the decoder of the transformer model and is intentionally shifted right to ensure that the model predicts each subsequent token based only on the tokens that came before it. This right-shifted input is essential during training, enabling the model to learn accurate and context-aware token prediction by preventing it from seeing the future token it is tasked to predict.

  2. The text is encoded again using tokenization and embedding methods to ensure uniform representation with the encoder’s output. This process aligns the generated text’s representational space with that of the input, which is essential for coherent and context-aware text generation.

  3. Positional encoding provides the model with the context of each token’s position within the sequence by adding a unique vector to each token’s embedding, which represents its position in the sequence. This encoding is applied to the previous output vector to keep the order of the words in the sentence or paragraph.

  4. The decoder utilizes masked self-attention in a similar way to the encoder, but it masks future tokens. This allows the model to predict future tokens and prevents it from attending to future positions in the sequence.

  5. The queries at this step come from the previous masked self-attention layer of the decoder, while the keys and values come from the output of the encoder. This allows the decoder to attend to the entire input sequence. Self-attention using query, key, and value vectors is performed on the positional encoded vectors (refer to image 1 above). A dot product is taken between queries and keys to produce a score, which is then scaled and passed through a softmax function to create attention weights (refer to image 2 above). Weights are used to create a weighted sum of the value vectors. The output vectors from the self-attention mechanism are added to the original input vectors through a residual connection. Layer normalization is then applied to this combined output.

  6. The normalized output vectors are passed onto a feedforward neural network (refer to image 3 above) to process the tokens, and an output vector for each token that represents a transformed feature space is produced. The output is then again added to the original input vectors through a residual connection. Layer normalization is then applied to this combined output.

  7. The output is passed through a linear transformation function to convert the decoder output to logits (scores for each possible next token). Mathematically, this can be represented as:

logits=Wooutput+b\text{logits} = W_o \cdot \text{output} + b

  1. Finally, a softmax function layer is added to generate probabilities for the next token in the sequence of words (refer to image 4 below). Mathematically, this can be represented as:

softmax(xi)=exijexj\text{softmax}(x_i) = \frac{e^{x_i}}{\sum_{j} e^{x_j}}

Image 4: Linear and softmax transformation
Image 4: Linear and softmax transformation

Leveraging transformers in chatbot development

The most critical step in utilizing transformers for chatbot development is selecting the appropriate model. With a large selection of models available, such as GPT (Generative Pre-trained Transformer), BERT (Bidirectional Encoder Representations from Transformers), and T5 (Text-to-Text Transfer Transformer), the choice depends on the specific requirements of the chatbot. For example, Falcon, Mistral, and Llama2, which are transformer-based LLMs, are ideal for creating chatbots that require high levels of conversational fluency. BERT, with its deep understanding of context and language nuances, is suitable for chatbots focusing on answering specific simple queries accurately, quickly, and efficiently. T5, a model designed for a wide range of NLP tasks, including translation, summarization, and question answering, thanks to its text-to-text approach, is ideal for chatbots requiring a variety of linguistic transformations.

Developers need to consider several factors

  • Language understanding and generation: An important consideration is how well a model comprehends and how it generates language. This capability determines the chatbot’s effectiveness in interpreting user inputs and producing coherent responses that would impact the user experience.

  • Computational resources: The computational demands of different models vary a lot. While a developer might be able to train a 7 billion parameter Llama model on a PC, a 180 billion parameter Falcon model would need multiple powerful graphics processing units (GPU) to be able to train. This distinction highlights the practical considerations that developers need to account for balancing between a model’s capabilities and the available computational infrastructure.

  • Definitions: The following factors play a crucial role in understanding and optimizing a model’s performance:

    • Parameters: These refer to the number of trainable weights in a model, which is different from the size of the vocabulary or tokens that a model was trained on.

    • Trainable weights: These are the components within the neural network that are optimized through learning from data, allowing the model to make accurate predictions by calculating the loss function of a model: the predictive value minus the actual value.

    • Loss function of a model: This is a mathematical formula that measures the difference between the model’s predicted output and the actual target values for a set of data. It quantifies the model’s error or loss, providing a metric to evaluate how well the model is performing.

Deploying the chatbot in a real-world scenario

The objective or final step of any LLM is to deploy the chatbot in a real-world scenario, making it accessible to users through an online platform, such as a website, mobile app, or social media channel. The deployment involves integrating the chatbot with the existing infrastructure or creating a new one, ensuring it can handle multiple user queries in real time. This stage requires setting up monitoring and logging systems to track the chatbot’s performance, identifying any issues, and collecting user feedback for continuous improvement. After deployment, ongoing maintenance is essential for updating or augmenting the model with new data to enhance its responses, and to adapt to new changes in user behavior or in new incoming data. Deploying a chatbot successfully requires careful planning, robust infrastructure, and a commitment to continuous learning and adaptation.

Our case study outlines the comprehensive process of leveraging transformer models to develop and deploy a chatbot. From choosing the right model and fine-tuning it with domain-specific data to deploying and maintaining the chatbot in a real-world environment, each step is essential for creating a responsive, efficient, and user-friendly chatbot that meets the specific needs of its intended target audience.

Note: Please be mindful that the LLMs utilized in all the notebooks in the following lessons are hosted on the Hugging Face platform, which updates and manages its model repository. On very rare occasions, specific models might be taken down for various reasons such as for policy changes by model contributors. If you encounter an error indicating that a model is not available, then please visit the Hugging Face Model Hub at https://huggingface.co/models and look for an alternative model that suits the same task at hand, and then replace that existing unavailable model

In addition, The examples and outputs generated from the LLM models in all the following lessons are produced using basic language models, which, while generally accurate, may sometimes repeat information or produce less nuanced content.