The name GPT-3 stands for “Generative Pre-trained Transformer 3.” Let’s go through all these terms individually to understand the making of GPT-3.

Generative models

GPT-3 is a generative model because it generates text. Generative modeling is a branch of statistical modeling. It is a method for mathematically approximating the world. We are surrounded by an incredible amount of easily accessible information—both in the physical and digital worlds. The tricky part is to develop intelligent models and algorithms that can analyze and understand this treasure trove of data. Generative models are one of the most promising approaches to achieving this goal.

To train a model, we must prepare and preprocess a dataset, a collection of examples that helps the model learn to perform a given task. Usually, a dataset is a large amount of data in some specific domain, like using millions of images of cars to teach a model what a car is. Datasets can also take the form of sentences or audio samples. Once we have shown the model many examples, we must train it to generate similar data.

Pre-trained models

Have you heard of the theory of 10,000 hours? In his book Outliers, Malcolm Gladwell suggested that practicing any skill for 10,000 hours is sufficient to make you an expert. This expert knowledge is reflected in the connections our human brains develop between their neurons. An AI model does something similar.

Training

To create a model that performs well, we need to train it using a specific set of variables called parameters. The process of determining the ideal parameters for our model is called training. The model assimilates parameter values through successive training iterations.

A deep learning model takes a lot of time to find these ideal parameters. Training is a lengthy process that, depending on the task, can last from a few hours to a few months and requires tremendous computing power. Reusing some of that long learning process for other tasks would significantly help. And this is where the pre-trained models come in.

Model training process
Model training process

A pre-trained model, keeping with Gladwell’s 10,000 hours theory, is the first skill we develop to help us acquire another faster. For example, mastering the craft of solving math problems can allow us to acquire the skill of solving engineering problems faster. A pre-trained model is trained (by us or someone else) for a more general task and can be fine-tuned for different tasks. Instead of creating a brand new model to address our issue, we can use a pre-trained model that has already been trained on a more general problem. The pre-trained model can be fine-tuned to address our specific needs by providing additional training with a tailored dataset. This approach is faster and more efficient and allows for improved performance compared to building a model from scratch.

Training dataset

In machine learning, a model is trained on a dataset. The size and type of data samples vary depending on the task we want to solve. GPT-3 is pre-trained on a corpus of text from five datasets: Common Crawl, WebText2, Books1, Books2, and Wikipedia.

Common crawl

The Common Crawl corpus comprises petabytes of data, including raw web page data, metadata, and text data collected over eight years of web crawling. OpenAI researchers use a curated, filtered version of this dataset.

WebText2

WebText2 is an expanded version of the WebText dataset, an internal OpenAI corpus created by scraping particularly high-quality web pages. To vet for quality, the authors scraped all outbound links from Reddit, which received at least three karma (an indicator for whether other users found the link interesting, educational, or just funny). WebText contains 40 gigabytes of text from these 45 million links and over 8 million documents.

Books1 and books2

Books1 and Books2 are two corpora, or collections of text, that contain the text of tens of thousands of books on various subjects.

Wikipedia

A collection including all English-language articles from the crowdsourced online encyclopedia Wikipedia at the time of finalizing the GPT-3 dataset in 2019. The dataset used has roughly 5.8 million English articles.

This corpus includes nearly a trillion words altogether.

Languages in datasets

GPT-3 is capable of generating and successfully working with languages other than English as well. The table below shows the dataset’s top 10 other languages.

Top 10 languages in the GPT-3 dataset

Documents for 10 languages in the GPT-3 dataset

Language

Number of documents

English

235,987,420

German

3,014,597

French

2,568,341

Portuguese

1,608,428

Italian

1,456,350

Spanish

1,284,045

Dutch

934,788

Polish

632,959

Japanese

619,582

Danish

396,477

While the gap between English and other languages is dramatic—English is number one, with 93% of the dataset; German, at number two, accounts for just 1%—that 1% is sufficient to create perfect text in German, with style transfer and other tasks. The same goes for other languages on the list.

Since GPT-3 is pre-trained on an extensive and diverse corpus of text, it can successfully perform a surprising number of NLP tasks without users providing any additional example data.

Transformer models

Neural networks are at the heart of deep learning, with their name and structure being inspired by the human brain. They are composed of a network or circuit of neurons that work together. Advances in neural networks can enhance the performance of AI models on various tasks, leading AI scientists to continually develop new architectures for these networks. One such advancement is the transformer, a machine learning model that processes a sequence of text all at once rather than one word at a time and has a strong ability to understand the relationship between those words. This invention has dramatically impacted the field of natural language processing. Here is the architecture of the transformer-based Seq2Seq model:

Architecture of transformer-based Seq2Seq model
Architecture of transformer-based Seq2Seq model

Sequence-to-sequence models

Researchers at Google and the University of Toronto introduced a transformer model in a 2017 paper:

We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality, more parallelizable, and require significantly less time to train.

The foundation of transformer models is sequence-to-sequence architecture. Sequence-to-sequence (Seq2Seq) models are useful for converting a sequence of elements, such as words in a sentence, into another sequence, such as a sentence in a different language. This is particularly effective in translation tasks, where a sequence of words in one language is translated into a sequence of words in another language. Google Translate started using a Seq2Seq-based model in 2016.

An example of Seq2Seq machine translation
An example of Seq2Seq machine translation

Seq2Seq models are comprised of two components: an encoder and a decoder. The Encoder can be thought of as a translator who speaks French as their first language and Korean as their second language. The Decoder is a translator who speaks English as their first language and Korean as their second language. To translate French to English, the Encoder converts the French sentence into Korean (also known as the context) and passes it on to the Decoder. Since the Decoder understands Korean, it can translate the sentence from Korean to English. The Encoder and Decoder can successfully translate from French to English, as illustrated above.

Transformer attention mechanism

Transformer architecture was invented to improve AIs’ performance on machine translation tasks. “Transformers started as language models,” Kilcher explains. “Not even that large, but then they became large.”

To use transformer models effectively, it is crucial to grasp the concept of attention to use transformer models effectively. Attention mechanisms mimic how the human brain focuses on specific parts of an input sequence, using probabilities to determine which parts of the sequence are most relevant at each step.

An example of self-attention for a sentence
An example of self-attention for a sentence

For example, look at the sentence, “The cat sat on the mat once it ate the mouse.” Does “it” in this sentence refer to “the cat” or “the mat?” The transformer model can strongly connect “it with “the cat.” That’s attention.

An example of how the Encoder and Decoder work together is when the Encoder writes down important keywords related to the meaning of the sentence and provides them to the Decoder along with the translation. These keywords make it easier for the Decoder to understand the translation because it now has a better understanding of the critical parts of the sentence and the terms that provide context.

Types of attention

The transformer model has two types of attention: self-attention (the connection of words within a sentence) and Encoder-Decoder attention (the connection between words from the source sentence to words from the target sentence).

The attention mechanism helps the transformer filter out the noise and focus on what’s relevant: connecting two words in semantic relationship to each other that do not carry any apparent markers pointing to one another.

Transformer models benefit from larger architectures and larger quantities of data. Training on large datasets and fine-tuning for specific tasks improve results. Transformers better understand the context of words in a sentence than any other kind of neural network. GPT is just the Decoder part of the transformer.

Test your understanding

Match The Answer
Select an option from the left-hand side

In the sentence, “The man did not cross the street because it was too full,” what does the instance of the word “it” most relate to?

Car

In the sentence, “The car needs fuel if it wants to go that far,” what does the instance of the word “it” most relate to?

Street

In the sentence, “The street is closed by the municipal corporation due to the security incidents it is facing,” what does the instance of the word “it” most relate to?

Man

In the sentence, “The goose did not go to the street because it was too scared,” what does the instance of the word “it” most relate to?

Municipal corporation

Goose