The Emergence of NLP
Learn about the evolution of natural language processing (NLP) from simple rule-based methods to advanced statistical and probabilistic models. Understand key concepts like Bag of Words, TF-IDF, and n-gram models, which enable machines to process and generate human language, forming the basis of modern generative AI systems.
What is natural language processing?
Generative AI may feel futuristic, but its creativity in writing, problem-solving, and conversation is already here. None of this would exist without natural language processing (NLP), the field that teaches machines to read, parse, and interpret human language. Before an AI can generate poetry, code, or art, it must first learn to understand language, making NLP the foundation of every breakthrough in generative technology.
Natural language processing (NLP) is the branch of artificial intelligence that enables computers to understand, interpret, and generate human language. It combines linguistics, computer science, and machine learning to enable machines to make sense of text and speech, from recognizing grammatical structure to extracting meaning and intent.
This lesson traces NLP’s journey from simple rule-based systems to methods like Bag of Words, TF-IDF, n-grams, and word embeddings. Each step forward, from rules to statistics to deep learning, was guided by one key question: “How can machines truly understand language?”
These breakthroughs paved the way for today’s large language models. By building on decades of progress, they now bring us closer to the idea of artificial general intelligence (AGI).
How did computers first interpret the text?
Early NLP relied on rule-based systems, where linguists and developers wrote detailed if-then instructions for every grammar quirk. Computers scanned text, matched it against these rules, and produced outputs that worked only in narrow cases. These systems could manage small tasks, like checking subject-verb agreement, but they were rigid and brittle. A new phrase or unusual wording would often break them, highlighting the need for models that could learn from data instead of fixed rules.
For example, a system might correctly change “he is” to “he’s” but fail completely on “he really is,” producing something awkward like “he really’s.”
Educative byte: One of the earliest examples of rule-based NLP is ELIZA, developed in the 1960s by Joseph Weizenbaum. ELIZA could mimic a psychotherapist by following scripted patterns, showcasing the potential and limitations of rule-based systems.
Early NLP relied on rigid rule-based systems, but encoding every language rule quickly became unmanageable. The shift to data-driven methods allowed machines to learn from word counts and patterns, paving the way for modern statistical approaches.
What is a Bag of Words (BoW)?
Bag of Words (BoW) is one of the earliest statistical NLP methods, popular since the 1960s for text classification and search. Instead of rules, it counts how often each word appears, ignoring grammar and order. For example, “cat sat on mat” is treated the same as “on mat cat sat.” Though simple and order-blind, BoW proved powerful for tasks like spam detection and topic classification.
Educative byte: BoW gained momentum in the 1990s with vector space models and early search engines, paving the way for modern text retrieval techniques.
You have two short sentences: “I love cats” and “I hate dogs.” First, you gather all the unique words from both sentences into a vocabulary: namely ["I", "love", "cats", "hate", "dogs"]. Next, you count how many times each vocabulary word appears in each sentence. The sentence “I love cats” includes “I,” “love,” and “cats” once each, so it transforms into the vector [1, 1, 1, 0, 0] (corresponding to the order in your vocabulary). Meanwhile, “I hate dogs” contains “I,” “hate,” and “dogs” once apiece, giving us [1, 0, 0, 1, 1].
In Python, for instance, you can use CountVectorizer from scikit-learn to handle tokenization and counting automatically.
Bag of Words is fast and straightforward, which made it popular in early search engines and text classification. It counts how often each word appears in a text, which works well for grouping documents by topic.
However, it cannot capture meaning in tasks like sentiment analysis or translation, where word order matters. Even with this weakness, BoW represented a significant leap forward because it transformed raw language into structured features that computers could process effectively.
What is TF-IDF?
A Bag of Words counts word occurrences but treats every word the same, so common terms like “the” or “is” can overshadow more meaningful ones. This makes it hard to capture what truly defines a document.
That’s where TF-IDF (Term Frequency–Inverse Document Frequency) steps in. TF-IDF addresses this by assigning higher weights to words that are frequent in a particular document but rare across all documents. It’s like saying, “Hey, I know ‘the’ appears a thousand times, but maybe that’s just because it’s a common word. Let’s not give it too much weight.” This way, words that are unique to specific documents stand out, while common words are downplayed.
Let’s break down the concept into two parts:
Term Frequency (TF) measures how often a word appears in a single document relative to the total number of words in that document. For instance, in a product review, the word “durable” might appear frequently if many customers praise the product’s longevity. Mathematically, it’s expressed as:
Inverse Document Frequency (IDF) gauges how rare or special a word is across the entire
. Common words like “the,” “and,” or “is” appear in nearly every document, rendering them less informative. In contrast, unique words like “durable” or “innovative” may appear in fewer documents, signaling their importance. The formula for IDF is:corpus A collection of documents or texts
By combining TF and IDF into TF-IDF, we create a weighted representation that emphasizes important words within specific documents while diminishing the weight of ubiquitous terms. This makes TF-IDF a more informative feature for machine learning models, enhancing tasks like classification and clustering.
For example, let’s say the TF of “recipe” in a document is
TF-IDF highlights important but unique words in a document. A word like “recipe” may have a high TF-IDF score because it’s both frequent in a specific document and not overly common across the corpus. A common word like “the” will have a low score because its IDF is low, even if its TF is high.
Educative byte: TF-IDF is based on the work of Hans Peter Luhn on term frequency and Karen Spärck Jones on inverse document frequency. Interestingly, these developments occurred two decades apart, showcasing the evolving understanding of word importance in text analysis.
Another way to think of TF-IDF is like a spotlight: TF brightens words that appear often in one document, while IDF dims those that appear everywhere. Together, they highlight the terms that truly stand out.
This made text representations more meaningful, improving tasks like classification and clustering. Tools like Python’s TfidfVectorizer turn these scores into features for machine learning, giving models a clearer signal about what matters.
Still, TF-IDF shares Bag of Words’ limitation of ignoring word order. It weighs words more intelligently, but treats them as isolated tokens. This gap led to newer methods that capture relationships between words, paving the way for embeddings and modern NLP.
Below is a Python implementation that walks through the steps of computing TF-IDF:
What are n-gram models?
While TF-IDF highlights important words, it still ignores order. N-gram models addressed this by using probabilities to predict the next word based on its history.
The idea behind n-grams is straightforward: given a sequence of words, the model estimates the probability of the next word by looking at the frequency of word combinations (or “grams”) in a corpus of text. Language is sequential, so n-grams were like giving models the ability to see “neighbors” in a sentence. For example, in a bigram model (n=2), we calculate the probability of each word given the previous word.
What’s going on here? Don’t worry if the math feels intimidating—we’ll explain every single symbol and formula in this course step by step. In the mathematical representation above:
represents the conditional probability of the current word , given the previous word . It’s the likelihood of appearing immediately after . In simpler terms, it’s like asking, “If I’ve just seen the word , what’s the chance that will come next? Think of it as the model’s best guess for the next word based on the one that came before it. represents the number of times the word pair appears together in the corpus. Imagine you’re flipping through a novel, tallying every time the words “peanut butter” appear side by side. That’s what this count does—track how often specific pairs of words occur together. represents the total number of times the word appears in the corpus, regardless of what word comes after it. Think of this as a popularity contest for , how many times does it show up, no matter who it’s hanging out with?
You see, it’s like figuring out the probability that someone who orders “peanut” will also add “butter” to their plate, based on how often the two are paired in your data. Similarly, in a trigram model (n=3), the prediction depends on the two preceding words:
This method allows machines to generate or predict text based on observed word patterns. For example, if the corpus contains the sentence “I love pizza,” and we ask the model what word is likely to follow “I love,” it would assign a high probability to “pizza” if that pairing appeared frequently in the training data.
1-Gram | 2-Gram | 3-Gram |
Generative | Generative AI | Generative AI is |
AI | AI is | AI is fun |
is | is fun | is fun to |
fun | fun to | fun to learn |
to | to learn | - |
learn | - | - |
Below is a Python example that creates and interprets a bigram probability matrix.
When you run the code, you’ll see a table where:
A probability of
means a word is always followed by a specific next word in the dataset (e.g., if “love” is only followed by “natural”). A probability of
indicates multiple possible successors (e.g., if “language” precedes both equally). A probability of
means a word never follows another within the dataset. Words at the end of sentences naturally have no successors, leading to zero probabilities.
The bigram matrix acts like a roadmap, showing which words are most likely to follow others. If a pair never appears in training, its probability would normally be zero. To avoid this, researchers apply smoothing methods that give unseen pairs a small, nonzero chance. A common example is Laplace (add-one) smoothing, which ensures flexibility by assigning every possible pair at least a tiny probability.