Large Language Models

Get introduced to language models and large language models.


Let’s imagine a conversation with a friend, where the friend starts a sentence with: “I’m going to make a cup of ________.” Humans would likely predict that the next word could be coffee or tea based on their knowledge of common beverage choices.

Press + to interact

Similarly, a language model is trained to understand and predict the next word in a sequence based on the context of the preceding words. It learns from vast amounts of text data and can make informed predictions about what word will likely come next in a given context.

Before going into more detail, let’s first discuss what language models are.

Language models

A language model (LM) can be defined as a probabilistic model that assigns probabilities to sequences of words or tokens in a given language. The goal is to capture the structure and patterns of the language to predict the likelihood of a particular sequence of words.

Press + to interact

Let’s assume we have a vocabulary VV that contains a sequence of words or tokens denoted as w1,w2,wnw_1,w_2,\dots w_n, where nn is the length of the sequence. The LM assigns probabilities (p)(p) to every possible sequence or order of words belonging to a vocabulary (V).(V).

The probability of the entire sequence can be expressed as follows:


Assume we have V={chase, the, cat, the, mouse}V=\{\text{chase, the, cat, the, mouse}\}, and following probabilities (p)(p) assigned:

Press + to interact

Note: Language models must have external knowledge for them to be able to assign meaningful probabilities; therefore, they are trained. During this training process, the model learns to assign higher probabilities to words more likely to follow a given context. After training, the language model can generate text by sampling words based on these learned probabilities.


We can also predict a word given a sequence. A language model estimates this probability by considering the conditional probabilities of each word given the previous words in the sequence. Using the chain rule of probability, the joint probability of the sequence can be decomposed as:

For example : p(the, cat, chase, the, mouse)=p(the). p(cat|the). p(chase|the, cat). p(the|the, cat, chase) .p(mouse|the, cat, chase, the)p(\text{the, cat, chase, the, mouse})=p(\text{the}).\space p(\text{cat|the}).\space p(\text{chase|the, cat}).\space p(\text{the|the, cat, chase})\space .p(\text{mouse|the, cat, chase, the})

In practice, modeling these conditional probabilities accurately is a complex task. Modern language models, such as Transformer-based models like GPT-3, utilize deep learning techniques to capture intricate patterns and dependencies in the data.

N-gram language model

N-gram models are a type of probabilistic language model used in natural language processing and computational linguistics. These models are based on the idea that the probability of a word depends on the previous n1n-1 words in the sequence. The term “n-gram” refers to a consecutive sequence of nn items.

For example, consider the following sentence: I love language models.

  • Unigram (1-gram): “I,” “love,” “language,” “models”

  • Bigram (2-gram): “I love,” “love language,” “language models”

  • Trigram (3-gram): “I love language,” “love language models”

  • 4-gram: “I love language models”

N-gram models are simple and computationally efficient, making them suitable for various natural language processing tasks. However, their limitations include the inability to capture long-range dependencies in language and the sparsity problem when dealing with higher-order n-grams.

Note: More advanced language models, such as recurrent neural networks (RNN), have been replaced by large language models.

Its algorithm is as follows:

  1. Tokenization: Split the input text into individual words or tokens.

  2. N-gram generation: Create n-grams by forming sequences of nn consecutive words from the tokenized text.

  3. Frequency counting: Count the occurrences of each n-gram in the training corpus.

  4. Probability estimation: Calculate the conditional probability of each word given its previous n1n-1 words using the frequency counts.

  5. Smoothing (optional): Apply smoothing techniques to handle unseen n-grams and avoid zero probabilities.

  6. Text generation: Start with an initial seed of n1n-1 words, predict the next word based on probabilities, and iteratively generate the next words to form a sequence.

  7. Repeat generation: Continue generating words until the desired length or a stopping condition is reached.

Let’s see an example in action:

Press + to interact
import random
class NGramLanguageModel:
def __init__(self, n):
self.n = n
self.ngrams = {}
self.start_tokens = ['<start>'] * (n - 1)
def train(self, corpus):
for sentence in corpus:
tokens = self.start_tokens + sentence.split() + ['<end>']
for i in range(len(tokens) - self.n + 1):
ngram = tuple(tokens[i:i + self.n])
if ngram in self.ngrams:
self.ngrams[ngram] += 1
self.ngrams[ngram] = 1
def generate_text(self, seed_text, length=10):
seed_tokens = seed_text.split()
padded_seed_text = self.start_tokens[-(self.n - 1 - len(seed_tokens)):] + seed_tokens
generated_text = list(padded_seed_text)
current_ngram = tuple(generated_text[-self.n + 1:])
for _ in range(length):
next_words = [ngram[-1] for ngram in self.ngrams.keys() if ngram[:-1] == current_ngram]
if next_words:
next_word = random.choice(next_words)
current_ngram = tuple(generated_text[-self.n + 1:])
return ' '.join(generated_text[len(self.start_tokens):])
# Toy corpus
toy_corpus = [
"This is a simple example.",
"The example demonstrates an N-gram language model.",
"N-grams are used in natural language processing.",
"This is a toy corpus for language modeling."
n = 3 # Change n-gram order here
# Example usage with seed text
model = NGramLanguageModel(n)
seed_text = "This" # Change seed text here
generated_text = model.generate_text(seed_text, length=3)
print("Seed text:", seed_text)
print("Generated text:", generated_text)


  • Line 1: We import the random module to facilitate random choices during text generation.

  • Line 3: We define a class named NGramLanguageModel to encapsulate the functionality of the n-gram language model.

  • Lines 4–7: We define the constructor method for the class, which initializes the n-gram order n, the n-gram frequency dictionary ngrams, and a list of start tokens for padding the beginning of sentences. Then, we set the class attributes, i.e., the n-gram order n, the empty dictionary to store n-gram frequencies ngrams, and the list of start tokens used for padding start_tokens. The start_token class attribute serves to provide context for the beginning of sentences where there aren’t enough preceding words to form a complete n-gram. This ensures coherent and consistent text generation.

  • Lines 9–17: We define a method named train to train the language model on a given corpus. Then, we iterate through each sentence in the provided corpus. We tokenize the sentence by adding start tokens, splitting it into individual words, and appending an end token. Moreover, we iterate through the sentence to create n-grams by considering sequences of length n. We extract the current n-gram as a tuple from the token sequence and update the frequency count of the current n-gram in the ngrams dictionary.

  • Lines 19–34: We define a method named generate_text to generate text based on the trained language model, starting with a seed text.

  • Lines 37–53: We define a corpus for training and testing the language model. Then, we create an instance of the NGramLanguageModel class with n-gram order n=2, and train it on the corpus. Next, we specify a seed text, generate text based on the trained model, and print both the seed text and the generated text.

Large language models

Large language models (LLMs) refer to advanced natural language processing models trained on massive amounts of textual data. These models are designed to understand and generate human-like text based on the input they receive.

Comparison with simpler LMs

LLMs and simpler LMs differ primarily in scale, complexity, and the task they are designed to perform. Here’s a comparison between large language models and simpler models:




Scale and Parameters

Tens to hundreds of billions of parameters

Millions of parameters

Training Data

Trained on vast and diverse datasets from the internet

Can be trained on smaller, domain-specific datasets


Highly versatile, excelling across various NLP tasks

Task-specific, might require more fine-tuning

Computational Resources

Demands significant computational power and specialized hardware

More computationally efficient, accessible on standard hardware

Use Cases

Complex language understanding, translation, summarization, creative writing

Specific tasks like sentiment analysis and named entity recognition

Now, let’s take a quiz to revisit the concepts taught in this lesson.


Read the question statement and then select the correct answer out of the given choices.


What is a language model?


A set of grammar rules and guidelines used for teaching a language


A probabilistic model that assigns probabilities to sequences of words or tokens in a given language


Software that translates text from one language to another


A database of definitions and synonyms for words in a specific language

Question 1 of 20 attempted