What is natural language processing?

Generative AI may feel futuristic, but its creativity in writing, problem-solving, and conversation is already here. None of this would exist without natural language processing (NLP), the field that teaches machines to read, parse, and interpret human language. Before an AI can generate poetry, code, or art, it must first learn to understand language, making NLP the foundation of every breakthrough in generative technology.

Natural language processing (NLP) is the branch of artificial intelligence that enables computers to understand, interpret, and generate human language. It combines linguistics, computer science, and machine learning to enable machines to make sense of text and speech, from recognizing grammatical structure to extracting meaning and intent.

This lesson traces NLP’s journey from simple rule-based systems to methods like Bag of Words, TF-IDF, n-grams, and word embeddings. Each step forward, from rules to statistics to deep learning, was guided by one key question: “How can machines truly understand language?”

These breakthroughs paved the way for today’s large language models. By building on decades of progress, they now bring us closer to the idea of artificial general intelligence (AGI).

How did computers first interpret the text?

Early NLP relied on rule-based systems, where linguists and developers wrote detailed if-then instructions for every grammar quirk. Computers scanned text, matched it against these rules, and produced outputs that worked only in narrow cases. These systems could manage small tasks, like checking subject-verb agreement, but they were rigid and brittle. A new phrase or unusual wording would often break them, highlighting the need for models that could learn from data instead of fixed rules.

For example, a system might correctly change “he is” to “he’s” but fail completely on “he really is,” producing something awkward like “he really’s.”

Educative byte: One of the earliest examples of rule-based NLP is ELIZA, developed in the 1960s by Joseph Weizenbaum. ELIZA could mimic a psychotherapist by following scripted patterns, showcasing the potential and limitations of rule-based systems.

Early NLP relied on rigid rule-based systems, but encoding every language rule quickly became unmanageable. The shift to data-driven methods allowed machines to learn from word counts and patterns, paving the way for modern statistical approaches.

What is a Bag of Words (BoW)?

Bag of Words (BoW) is one of the earliest statistical NLP methods, popular since the 1960s for text classification and search. Instead of rules, it counts how often each word appears, ignoring grammar and order. For example, “cat sat on mat” is treated the same as “on mat cat sat.” Though simple and order-blind, BoW proved powerful for tasks like spam detection and topic classification.

Educative byte: BoW gained momentum in the 1990s with vector space models and early search engines, paving the way for modern text retrieval techniques.

You have two short sentences: “I love cats” and “I hate dogs.” First, you gather all the unique words from both sentences into a vocabulary: namely ["I", "love", "cats", "hate", "dogs"]. Next, you count how many times each vocabulary word appears in each sentence. The sentence “I love cats” includes “I,” “love,” and “cats” once each, so it transforms into the vector [1, 1, 1, 0, 0] (corresponding to the order in your vocabulary). Meanwhile, “I hate dogs” contains “I,” “hate,” and “dogs” once apiece, giving us [1, 0, 0, 1, 1].

Bag of Words is fast and straightforward, which made it popular in early search engines and text classification. It counts how often each word appears in a text, which works well for grouping documents by topic.

However, it cannot capture meaning in tasks like sentiment analysis or translation, where word order matters. Even with this weakness, BoW represented a significant leap forward because it transformed raw language into structured features that computers could process effectively.

What is TF-IDF?

A Bag of Words counts word occurrences but treats every word the same, so common terms like “the” or “is” can overshadow more meaningful ones. This makes it hard to capture what truly defines a document.

That’s where TF-IDF (Term Frequency–Inverse Document Frequency) steps in. TF-IDF addresses this by assigning higher weights to words that are frequent in a particular document but rare across all documents. It’s like saying, “Hey, I know ‘the’ appears a thousand times, but maybe that’s just because it’s a common word. Let’s not give it too much weight.” This way, words that are unique to specific documents stand out, while common words are downplayed.

TF-IDF highlights important but unique words in a document. A word like “recipe” may have a high TF-IDF score because it’s both frequent in a specific document and not overly common across the corpus. A common word like “the” will have a low score because its IDF is low, even if its TF is high.

Educative byte: TF-IDF is based on the work of Hans Peter Luhn on term frequency and Karen Spärck Jones on inverse document frequency. Interestingly, these developments occurred two decades apart, showcasing the evolving understanding of word importance in text analysis.

Another way to think of TF-IDF is like a spotlight: TF brightens words that appear often in one document, while IDF dims those that appear everywhere. Together, they highlight the terms that truly stand out.

This made text representations more meaningful, improving tasks like classification and clustering. Tools like Python’s TfidfVectorizer turn these scores into features for machine learning, giving models a clearer signal about what matters.

Still, TF-IDF shares Bag of Words’ limitation of ignoring word order. It weighs words more intelligently, but treats them as isolated tokens. This gap led to newer methods that capture relationships between words, paving the way for embeddings and modern NLP.

Below is a Python implementation that walks through the steps of computing TF-IDF:

Python 3.10.4

import math
# Sample documents
documents = [
    "The quick brown fox jumps over the lazy dog",
    "Never jump over the lazy dog quickly",
    "A fast brown fox leaps over a lazy dog"
]
def tokenize(document):
    """
    Simple tokenizer that lowercases and splits on whitespace.
    Removes punctuation for simplicity.
    """
    # Remove punctuation
    punctuations = '.,!?;:()[]{}\'"'
    for p in punctuations:
        document = document.replace(p, '')
    # Lowercase and split
    tokens = document.lower().split()
    return tokens
def compute_tf(doc_tokens):
    """
    Computes term frequency for a single document.
    Returns a dictionary of term frequencies.
    """
    tf = {}
    for term in doc_tokens:
        tf[term] = tf.get(term, 0) + 1
    # Optionally, normalize TF by the total number of terms in the document
    total_terms = len(doc_tokens)
    for term in tf:
        tf[term] = tf[term] / total_terms
    return tf
def compute_df(documents_tokens):
    """
    Computes document frequency for all terms in the corpus.
    Returns a dictionary of document frequencies.
    """
    df = {}
    for tokens in documents_tokens:
        unique_terms = set(tokens)
        for term in unique_terms:
            df[term] = df.get(term, 0) + 1
    return df
def compute_idf(df, total_docs):
    """
    Computes inverse document frequency for all terms.
    Returns a dictionary of IDF scores.
    """
    idf = {}
    for term, freq in df.items():
        idf[term] = math.log(total_docs / (1 + freq)) + 1  # Adding 1 to avoid division by zero
    return idf
def compute_tf_idf(tf, idf):
    """
    Computes TF-IDF for a single document.
    Returns a dictionary of TF-IDF scores.
    """
    tf_idf = {}
    for term, tf_value in tf.items():
        tf_idf[term] = tf_value * idf.get(term, 0)
    return tf_idf
def main(documents):
    # Step 1: Tokenize all documents
    documents_tokens = [tokenize(doc) for doc in documents]
    
    # Step 2: Compute TF for each document
    tfs = [compute_tf(tokens) for tokens in documents_tokens]
    
    # Step 3: Compute DF across all documents
    df = compute_df(documents_tokens)
    
    # Step 4: Compute IDF for all terms
    total_docs = len(documents)
    idf = compute_idf(df, total_docs)
    
    # Step 5: Compute TF-IDF for each document
    tf_idfs = [compute_tf_idf(tf, idf) for tf in tfs]
    
    # (Optional) Collect all unique terms for creating a TF-IDF matrix
    all_terms = sorted(df.keys())
    
    # Display TF-IDF scores
    for i, tf_idf in enumerate(tf_idfs):
        print(f"\nDocument {i+1} TF-IDF:")
        for term in all_terms:
            score = tf_idf.get(term, 0)
            if score > 0:
                print(f"  {term}: {score:.4f}")
if __name__ == "__main__":
    main(documents)

What are n-gram models?

While TF-IDF highlights important words, it still ignores order. N-gram models addressed this by using probabilities to predict the next word based on its history.

The idea behind n-grams is straightforward: given a sequence of words, the model estimates the probability of the next word by looking at the frequency of word combinations (or “grams”) in a corpus of text. Language is sequential, so n-grams were like giving models the ability to see “neighbors” in a sentence. For example, in a bigram model (n=2), we calculate the probability of each word given the previous word.

What’s going on here? Don’t worry if the math feels intimidating—we’ll explain every single symbol and formula in this course step by step. In the mathematical representation above:

$P(w_i \mid w_{i-1})$ represents the conditional probability of the current word $w_{i}$ , given the previous word $w_{i−1}$ . It’s the likelihood of $w_{i}$ appearing immediately after $w_{i−1}$ . In simpler terms, it’s like asking, “If I’ve just seen the word $w_{i-1}$ , what’s the chance that $w_{i}$ will come next? Think of it as the model’s best guess for the next word based on the one that came before it.
${\text{Count}(w_{i-1}, w_i)}$ represents the number of times the word pair appears together in the corpus. Imagine you’re flipping through a novel, tallying every time the words “peanut butter” appear side by side. That’s what this count does—track how often specific pairs of words occur together.
${\text{Count}(w_{i-1})}$ represents the total number of times the word appears in the corpus, regardless of what word comes after it. Think of this as a popularity contest for $w_{i-1}$ , how many times does it show up, no matter who it’s hanging out with?

You see, it’s like figuring out the probability that someone who orders “peanut” will also add “butter” to their plate, based on how often the two are paired in your data. Similarly, in a trigram model (n=3), the prediction depends on the two preceding words:

Python 3.10.4

# Toy dataset
sentences = [
    "I love natural language processing",
    "Language models are amazing"
]
# Function to generate bigrams
def generate_bigrams(sentence):
    words = sentence.lower().split()  # Tokenization (lowercase + split)
    bigrams = [(words[i], words[i + 1]) for i in range(len(words) - 1)]
    return bigrams, words  # Return bigrams and word list
# Collect all words and bigrams
all_bigrams = []
all_words = set()  # To store unique words
for sentence in sentences:
    bigrams, words = generate_bigrams(sentence)
    all_bigrams.extend(bigrams)
    all_words.update(words)
# Sort words for consistent ordering
unique_words = sorted(all_words)
# Create bigram frequency matrix
bigram_matrix = {word: {w: 0 for w in unique_words} for word in unique_words}
# Count occurrences of bigrams
for bigram in all_bigrams:
    first, second = bigram
    bigram_matrix[first][second] += 1
# Convert frequency matrix to probability matrix
for word in unique_words:
    total_bigrams = sum(bigram_matrix[word].values())  # Total transitions from this word
    if total_bigrams > 0:
        for next_word in unique_words:
            bigram_matrix[word][next_word] /= total_bigrams  # Normalize to get probabilities
# Display bigram probability matrix in a well-formatted table
print("\nBigram Probability Matrix:\n")
# Print header row
header = ["Word"] + unique_words
col_width = 8  # Set a fixed column width for better alignment
# Print table header
print(f"{header[0]:<{col_width}}", end=" | ")
print(" | ".join(f"{w:<{col_width}}" for w in header[1:]))
print("-" * (col_width * (len(unique_words) + 1) + 3))
# Print each row with bigram probabilities
for word in unique_words:
    row_values = [f"{bigram_matrix[word][w]:.1f}" for w in unique_words]
    print(f"{word:<{col_width}}", end=" | ")
    print(" | ".join(f"{val:<{col_width}}" for val in row_values))

When you run the code, you’ll see a table where:

A probability of $1.0$ means a word is always followed by a specific next word in the dataset (e.g., $P(\text{natural} \mid \text{love}) = 1.0$ if “love” is only followed by “natural”).
A probability of $0.5$ indicates multiple possible successors (e.g., $P(\text{models} \mid \text{language}) = 0.5 \quad \text{and} \quad P(\text{processing} \mid \text{language}) = 0.5$ if “language” precedes both equally).
A probability of $0.0$ means a word never follows another within the dataset. Words at the end of sentences naturally have no successors, leading to zero probabilities.

The bigram matrix acts like a roadmap, showing which words are most likely to follow others. If a pair never appears in training, its probability would normally be zero. To avoid this, researchers apply smoothing methods that give unseen pairs a small, nonzero chance. A common example is Laplace (add-one) smoothing, which ensures flexibility by assigning every possible pair at least a tiny probability.

1-Gram	2-Gram	3-Gram
Generative	Generative AI	Generative AI is
AI	AI is	AI is fun
is	is fun	is fun to
fun	fun to	fun to learn
to	to learn	-
learn	-	-

1.Introduction to Generative AI

2.Building Blocks of Generative AI

3.Foundation Models

Project

4.Intelligent Interaction with GenAI

5.Practical Applications and Case Studies

6.Future of Generative AI and Wrap Up

The Emergence of NLP

What is natural language processing?

How did computers first interpret the text?

What is a Bag of Words (BoW)?

What is TF-IDF?

What are n-gram models?