Introduction to GenAI

Explore generative AI fundamentals including how transformer architectures work, the role of large language and multimodal models, and the function of diffusion models. Understand the technology behind content creation by AI and its impact across industries, preparing you to navigate and utilize generative AI effectively.

We'll cover the following...

How does generative AI work?
Foundation models
Conclusion

Generative artificial intelligence (AI) enables machines to create new content, such as images, text, or music, rather than just analyzing existing data. For instance, imagine a system that can generate lifelike artwork in seconds or write a personalized email draft based on minimal input. This groundbreaking technology is already transforming industries like healthcare, where it helps design new drugs, and entertainment, where it creates realistic visual effects. By bridging creativity and computation, Generative AI is reshaping our thinking about innovation and automation.

Generative Artificial Intelligence (AI) refers to the ability of machines to generate new content rather than simply performing recognition, detection, or prediction tasks in existing data. This revolutionary technology can potentially transform industries like healthcare, education, entertainment, and marketing. Generative AI’s applications range from generating realistic images and videos to creating coherent text and speech.

How does generative AI work?

Generative AI works by training large models based on neural networks on vast amounts of data to learn patterns, relationships, and structures within the data. During training, the model adjusts its internal parameters to minimize the difference between its output and the actual data, effectively “learning” how to generate outputs that are contextually relevant to the input.

Here’s a breakdown of the process:

Data training: Generative AI models are trained on large datasets, such as text data language models or images for visual models. These datasets help the model recognize patterns, syntax, semantics, and even style.
Fine-tuning: After this initial training, the model can be fine-tuned on more specific datasets if needed. This step adjusts the model’s parameters to improve its performance in a particular domain or specific tasks (like medical text generation or creative writing), making it more precise or aligned with specialized use cases.
Generation: Once trained, the model uses probabilities to generate content by predicting the most likely next item in a sequence—like the next word in a sentence or the next pixel in an image. It does this iteratively, building outputs step-by-step based on prior steps.

Through these steps, generative AI can produce new content that resembles the original data it was trained on, making it powerful for creative and functional tasks across different fields. The workings and use cases of different generative AI models often depend on their architecture. Let’s analyze the foundation models of generative AI so that we can build up an understanding of how they work:

Foundation models

Foundation models serve as the base for various applications. These models are characterized by their large scale, pretraining, and adaptability. Foundation models are trained on vast amounts of data, enabling them to learn complex patterns and relationships. This initial training is often followed by fine-tuning for specific tasks, allowing the models to adapt to various domains. Several categories fall under the umbrella of foundation models:

Transformer based models

Transformer-based models are neural network architecture used in many foundation models. Introduced in 2017, the transformer architecture is a type of deep learning model architecture that has revolutionized the field of generative AI. They are the building block for many advanced AI models, including LLMs. The previously used models, like RNNs, had issues remembering the context while processing large inputs. Transformers eliminated this issue by introducing self-attention mechanism, which made them remember the context over a long range of inputs and use it to generate the content relevant to the overall context.

Architecture

Transformers have two parts i.e., encoder and decoder. Here is a breakdown of how both of these work:

Encoder

The encoder processes the input sequence and creates a contextual representation, capturing relationships and dependencies among the tokens. Here’s how data progresses through each component of the encoder:

Input embeddings: The process begins with the input data (such as text tokens in NLP tasks). Each token (word, part of a word, or other unit) is converted into a dense vector representation known as an embedding. These embeddings capture semantic meaning and are passed on to the next layer. In addition, positional encoding is added to these embeddings to retain information about the relative or absolute position of tokens in the sequence.
Positional encoding: Transformers process the entire data sequence simultaneously (in parallel), so they need a mechanism to understand the order of the tokens in the sequence. Positional encodings are added to the token embeddings. These encodings provide information about the position of each token in the sequence, helping the model differentiate between tokens in different positions. The result is a sequence of vectors containing each token’s content and position.
Multi-head attention: Transformers use multiple self-attention head that learn different aspects of the relationships between tokens. The self-attention mechanism is the core of the transformer architecture. It allows the model to weigh the importance of different tokens relative to each other in the sequence.
- Query, Key, and Value (QKV): For each token, three vectors—query (Q), Key (K), and Value (V)—are computed, which are learned during training.
  - The Query represents the token in question.
  - The Key represents each token’s relevance to others.
  - The Value holds the information to be passed along.
- The attention score is calculated by comparing the Query of one token to the Keys of all other tokens. This determines how much focus (or attention) the model should give to other tokens in the sequence when processing a particular token.
- These scores are normalized using softmax and then used to compute a weighted sum of the Values. This output represents the input token that considers the context provided by other tokens in the sequence.
Each head performs the self-attention operation independently in parallel. The outputs of these multiple attention heads are then concatenated and passed through a linear transformation, which results in a richer, more nuanced representation of the input.
Feedforward neural network (FFN): After the multi-head attention mechanism, the output is passed through a feedforward neural network. This is typically a two-layer, fully connected network with an activation function (e.g., ReLU) applied in between. This layer helps transform the data further, refining the representation learned from the multi-head attention layer.

Decoder

The decoder takes the encoder’s output and generates the target sequence, such as a translated sentence, text sequence, or predicting a pixel while converting a low-resolution image to a high-resolution image. The decoder has similar layers as the encoder, but with some key differences for generating output sequentially:

Input embedding and positional encoding: Like the encoder, the decoder also embeds each token in the target sequence and adds positional encodings to retain the order of tokens.
Masked self-attention mechanism: In the decoder, self-attention is masked to prevent access to future tokens during training, ensuring that predictions are generated one step at a time. This mechanism calculates attention scores like in the encoder, but only for past and current tokens. This setup simulates real-world usage conditions, where the model only knows the preceding words when generating the next one. If future tokens were accessible, the model might rely on those tokens to make predictions, resulting in poor performance during actual generation tasks when future tokens aren’t accessible.
Multi-head attention (context attention): Here, the encoder-decoder attention layer (or cross-attention) takes the encoder’s output as keys and values and combines it with the decoder’s query to focus on relevant parts of the input sequence. This layer enables the decoder to use the encoder’s context information, ensuring coherence with the input.
Feedforward neural network (FFN): Similar to the encoder, the decoder has a two-layer feedforward neural network to process and refine the context-aware representation from the multi-head attention output.
Output layer: Finally, the decoder's output is passed through a linear layer and a softmax function to predict the next token in the sequence. The decoder generates one token at a time until it reaches an end-of-sequence token, completing the output.

By processing data through this structured encoder-decoder pipeline, transformers can generate high-quality outputs considering the full context of both input and output sequences. This architecture is the basis for many advanced generative AI models like LLMs, allowing them to excel in various complex tasks.

Large language models

LLMs are essentially applications of transformer architecture, but they have additional scale, training, and natural language processing specialization, which makes them particularly effective for complex language tasks like language translation, text summarization, code generation, and creative writing. Some key characteristics of LLMs are listed below:

Scale and scope: LLMs are very large-scale implementations of transformers, typically trained with billions (or even trillions) of parameters on extensive text datasets. This scale enables LLMs to capture refined language patterns and general knowledge, making them adept at writing, translation, and reasoning tasks.
Training and adaptation: LLMs undergo extensive pretraining on diverse datasets, allowing them to build a structural understanding of language. They can then be fine-tuned for various downstream tasks (e.g., summarization, translation) and are often optimized to perform many tasks without additional training.
General-purpose capability: LLMs are designed to be general-purpose models capable of understanding and generating human-like text across different tasks and domains. Due to their comprehensive pretraining, LLMs like ChatGPT can handle diverse language tasks out of the box.

LLMs highlight transformers’ incredible adaptability, showcasing their relevance in NLP.

Multimodal models

Multimodal models expand on LLMs by incorporating multiple data types, such as text-image, text-audio, and multimodal fusion. These models leverage the transformer architecture because they can handle large amounts of data and capture complex relationships within and across different modalities, like image-text matching or generated realistic images from text prompts.

Multimodal models come with different architecture components based on the task they need to perform:

Transformer-based architectures: Most multimodal models use transformer-based architectures as the backbone because of their flexibility and scalability, for example, Vision Transformers (ViT) for processing images and Text transformers like BERT, GPT, or T5 for processing text. These transformers process each modality separately at first (e.g., one transformer for images, another for text) before integrating their outputs.
Modality-specific encoders: Each modality (text, image, or audio) has its encoder. The encoder for text might be based on models like BERT or GPT, while the image encoder might be a Vision Transformer (ViT) or a Convolutional Neural Network (CNN) like ResNet.These encoders transform raw input (text, image, or other types) into a feature representation that captures the underlying structure of the data.
Fusion mechanisms: After encoding, the representations from each modality need to be fused or combined. This can be done at various stages:
- Early fusion: Raw data from each modality is combined and fed into the model at the input level before the separate modality-specific encoders are applied.
- Late fusion: Modalities are processed independently, and their representations are combined later in the model (after feature extraction). This approach is used in many multimodal models.
- Hybrid fusion: A combination of early and late fusion, where certain features from different modalities are merged at various points during processing, an example of which is cross-modal attention.
Cross-modal attention: Some multimodal models use cross-modal attention mechanisms to allow information from one modality to influence the processing of another modality. For example, a model processing text and images may use attention mechanisms to allow textual information to guide how the image features are interpreted and vice versa. This cross-modal attention helps understand the relationships between the different data types and ensures that relevant features from each modality are properly integrated.
Joint embedding space: A common approach is mapping the different modalities (text, image, etc.) into a shared latent or joint embedding space. This means that features from different modalities are encoded into a common vector space, where they can be more easily compared and combined. This allows the model to learn multimodal representations that integrate features from multiple sources (e.g., associating a caption with an image or correlating a spoken question with a visual answer).

Diffusion models

Diffusion models are a distinct generative approach that iteratively refines noise signals until they produce realistic data samples. They are used in machine learning to create complex data, such as images, audio, and more, by modeling the process of gradual noise removal. These models have recently gained popularity due to their ability to generate high-quality, diverse, and detailed outputs, and they are often used in applications like image synthesis.

The diffusion process involves a series of steps, each refining the input noise. The noise schedule controls the progression of noise levels throughout the diffusion process.

Starting with clean data: The model is given clean data (e.g., an image) as its initial input.
Adding noise: During training, the model gradually adds noise to this clean data across multiple time steps until it eventually reaches complete noise. This process is governed by a diffusion metric that controls how much noise to add at each step.
Learning to reverse: For each time step, the model learns to reverse this process by predicting and removing the noise to return to a clean state. After this, the model can effectively generate new data from scratch, guided by the learned patterns, by applying this reverse process to random noise.
Sampling: After training, the model can start with pure noise and apply its learned reverse process to generate new, original data, like a new image that didn’t exist.