Text Preprocessing Essentials

Explore fundamental text preprocessing techniques such as tokenization, stemming, and lemmatization that transform raw text into structured inputs for AI models. Understand why cleaning text data is critical for accurate and effective generative AI, enabling applications like chatbots, code generation, and sentiment analysis. This lesson guides you through how text normalization and token strategies enhance language model performance and handle linguistic complexities.

We'll cover the following...

The messy world of text
Why does text preprocessing matter?
What is tokenization?
- Types of tokenizations
From tokenization to normalization
- Types of normalization
What are some other preprocessing techniques?
Tackling embedded biases in real-world data

The same idea applies to generative AI models that write poetry, generate code, or engage in smooth conversation. Behind every response is careful processing of raw text into a usable form. Whether building a chatbot or coding assistant, models are only as strong as their data. Text preprocessing ensures clean input, which leads to reliable, high-quality results.

The messy world of text

Suppose you’re handed a giant folder of user-generated content: product reviews, live stream chat logs, or even casual Slack conversations within a company. You open one document and see something as follows:

As a human reader, you can probably get the gist of each statement. But imagine for a moment you’re a computer program trying to read this data: without any cleanup, all those extra symbols and stylistic flourishes quickly become noise, confusing any machine learning or AI algorithm. So, how do we tame messy text in a way machines can understand?

We’ll explore three key techniques that turn messy text into usable input: tokenization, stemming, and lemmatization. These steps emerged from real-world needs, such as improving search engines, and are now essential for modern foundation models. By learning their purpose, you’ll see how simple preprocessing tasks support today’s most advanced AI systems.

Why does text preprocessing matter?

In the 1960s and 1970s, search systems struggled with messy text full of inconsistent spacing, punctuation, and word variations. Researchers learned that splitting text, removing noise, and normalizing words were essential for keyword matching.

Those same ideas evolved into core NLP techniques that power tasks like classification and sentiment analysis, paving the way for generative AI. Even advanced models such as ChatGPT depend on these basics: text must be tokenized and normalized before producing a coherent reply.

Think of it like cleaning a camera lens. No matter how advanced the camera or the AI model, results suffer if the lens is cluttered. Preprocessing keeps the view clear, from simple search queries to cutting-edge AI conversations.

Educative byte: While early systems required extensive preprocessing, many modern transformer-based models are designed to handle relatively raw text. However, effective tokenization remains a critical first step, even for these models, to ensure consistency and manage vocabulary efficiently.

What is tokenization?

Tokenization is splitting raw text into smaller units called tokens. These might be words, subwords, or individual characters. Imagine you’re handed a dense, unformatted sentence without spaces or punctuation: "GenerativeAIisfascinatingandisthefuture". Without tokenization, deciphering meaningful segments becomes nearly impossible. Humans can intuitively parse this as a sign that generative AI is fascinating and is the future, but machines require explicit instructions to recognize word boundaries. Tokenization bridges this gap, enabling machines to identify and separate individual words or meaningful subunits within the text.

In reality, tokenization results may vary depending on the tokenizer used. The above image is for demonstration purposes only. Moreover, languages vary widely in their structure and word boundaries. For instance:

English: Spaces separate words, making basic tokenization relatively straightforward.
Chinese/Japanese: Words often aren’t separated by spaces, requiring statistical models or dictionary-based approaches to segment text.
Social media and code: Hashtags (#MachineLearning), contractions (can’t → can + not), and code snippets (int myVar = 5;) all require specialized tokenization strategies.

Effective tokenization must account for these linguistic nuances to accurately parse and process text across different languages, ensuring that NLP models remain versatile and applicable in diverse linguistic contexts.

Types of tokenizations

Tokenization can be approached in various ways, each suited to different applications and linguistic complexities:

Word tokenization splits text into individual words. For example, the sentence “Generative AI is fascinating.” becomes["Generative", "AI", "is", "fascinating", "."]. This method is straightforward but may struggle with compound words, contractions (e.g., "don't" might be split into ["don", "'t"] or handled specially), or hyphenated words. Advanced tokenizers use regular expressions or statistical models to manage these cases.
Subword tokenization breaks words into smaller units, which is particularly useful for handling unknown or rare words. For instance, “unhappiness” might be tokenized into ["un", "happiness"]. Take the word “tokenization.” A subword tokenizer might split it into ["token", "ization"], letting the model reuse “token” in “tokenizer” or “tokenized.” This is how models like GPT-4.5, Claude 3.7, and Grok 3 handle obscure terms like “supercalifragilisticexpialidocious” without breaking a sweat.

Note: Modern large language models often employ subword tokenization methods like byte pair encoding (BPE) or SentencePiece to handle massive vocabularies efficiently. Tools like “TikToken” are specifically designed for GPT-based models to keep track of token counts and ensure prompts fit within token limits. Context window sizes are evolving rapidly! We will take a closer look at these advanced tokenizers later in the course.

Character tokenization splits text into individual characters, such as
["G", "e", "n", "e", "r", "a", "t", "i", "v", "e", ...]. While this method captures every detail, it often results in longer sequences that can be computationally intensive for models to process.

By breaking text into tokens, models can more effectively process and generate language, producing responses that are more human-like. Effective tokenization enables generative AI to handle a wide range of content, from simple sentences to complex technical jargon, while maintaining accuracy and fluency.

Let’s implement a basic word tokenizer using Python. This example will split a sentence into words and punctuation marks, demonstrating how tokenization structures raw text.

Python 3.10.4

def simple_tokenize(text):
    tokens = []
    current_word = ""
    for char in text:
        if char.isalnum():
            current_word += char
        else:
            if current_word != "":
                tokens.append(current_word)  # Append the accumulated word.
                current_word = ""
            if char.strip() != "":  # Ignore whitespace.
                tokens.append(char)  # Append punctuation or other non-alphanumeric characters.
    if current_word != "":
        tokens.append(current_word)  # Append any remaining word.
    return tokens
# Example usage
sentence = "Generative AI is fascinating!"
tokens = simple_tokenize(sentence)
print(tokens)

This simple function iterates through each character in the input text, building words by collecting alphanumeric characters and separating out punctuation as individual tokens. While rudimentary, this approach highlights the fundamental process of tokenization, providing a clear starting point for more advanced techniques.

From tokenization to normalization

While the provided examples illustrate basic tokenization, real-world applications often utilize advanced libraries, such as NLTK, spaCy, or Hugging Face’s Tokenizers, for more efficient and sophisticated tokenization processes. These libraries handle a variety of languages and complex tokenization rules, making them indispensable for large-scale NLP projects.

Educative byte: GPT‑4.5 can handle around 128,000 input tokens at a time. Ever wonder how it measures input size? You guessed it—by counting tokens! Rather than tracking characters or raw words, the model breaks your text into smaller blocks that it can process, ensuring it stays within the massive but finite 128k-token envelope.

Claude 3.7, on the other hand, can maintain up to 200,000 tokens, while Grok 3 pushes the limit even further with around 1 million tokens, give or take! But remember—bigger context windows don’t always mean better models. While large token limits help with long documents and maintaining memory over extended conversations, different models excel in different areas.

While tokenization breaks text into manageable units, it does not solve the variability of human language. Words often appear in different forms, such as plurals, verb tenses, or comparatives. For example, "run," "running," and "ran" all express the same idea but would be treated as separate tokens. This fragmentation creates inefficiencies by forcing models to learn redundant patterns.

Types of normalization

Researchers recognized this problem early on, realizing that a consistent way to treat word variants was needed. This led to two main approaches for word normalization: stemming and lemmatization.

What is stemming?

Stemming is a rule-based process that truncates words by removing common prefixes or suffixes. It’s quick and computationally simple, making it popular for tasks like document classification and search engine indexing. By collapsing words like “cats” and “cat” into the common stem “cat,” or “running” and “runs” into “run,” stemming consolidates morphological variants so models can learn a single representation. This drastically reduces the vocabulary size in classical NLP pipelines, which can improve speed and accuracy.

Established algorithms such as the Porter stemmerA rule-based algorithm that removes common morphological endings from words to reduce them to their root form for text processing. or Snowball stemmerAn improved and more flexible version of the Porter Stemmer, supporting multiple languages with enhanced stemming rules. have been widely used in NLP for decades. They represent more refined rule sets than our simple example, but still operate on similar principles. However, they are not part of your journey to understand how GenAI works for now.

This simple stemmer removes suffixes but doesn’t account for all linguistic nuances. For instance, “faster” remains “faster” because it doesn’t match any suffix exactly. Also, “happily” becomes “happ” and “tried” becomes “tri,” reflecting the crude but efficient nature of stemming. This highlights the limitations of basic stemming approaches, emphasizing the need for more sophisticated methods in real-world applications.

What is lemmatization?

Lemmatization is a more sophisticated approach to mapping words to their base or dictionary form (a lemma). Unlike stemming, lemmatization typically requires knowledge of a word’s part of speech and may rely on morphological analyzers or lexical databases. It has deep origins in computational linguistics and classical philology, where scholars created tools to handle inflected forms of Latin, Greek, and other languages. As NLP matured, this linguistic expertise was integrated into text-processing pipelines, offering more precise normalization than stemming could provide.

Whereas a stemmer might turn “better” into “bett,” a good lemmatizer recognizes that “better” can be mapped to “good.” Similarly, “running” may become “run,” and “ran” may also become “run.” This yields more linguistically accurate groupings of word variants—crucial in tasks such as sentiment analysis, where subtle changes in meaning are significant.

We’ll also create a very basic lemmatizer using a predefined dictionary for irregular forms. This approach demonstrates how lemmatization can accurately reduce words to their lemmas based on known irregularities.

Python 3.10.4

def simple_lemmatize(word):
    # A minimal dictionary for known irregular forms.
    irregular_lemmas = {
        "running": "run",
        "happily": "happy",
        "ran": "run",
        "better": "good",
        "faster": "fast",
        "cats": "cat",
        "dogs": "dog",
        "are": "be",
        "is": "be",
        "have": "have"
    }
    return irregular_lemmas.get(word, word)
# Example usage
words = ["running", "happily", "ran", "better", "faster", "cats"]
lemmatized_words = [simple_lemmatize(word) for word in words]
print("Lemmatized Words:", lemmatized_words)

This simple lemmatizer only handles a few irregular forms and doesn’t cover the full complexity of English morphology. It illustrates the concept of lemmatization by accurately reducing known irregular words, highlighting the difference between stemming and lemmatization in handling linguistic nuances.

Which is better: stemming or lemmatization?

Both methods reduce words like "run," "runs," and "ran" to a base form, but they differ in approach.

Stemming uses simple rules to chop off suffixes, which is fast and works well for tasks like search indexing, even if results are imperfect.

Lemmatization relies on dictionaries and context to return accurate base forms, making it slower but more precise. It is valuable for tasks such as sentiment analysis and translation, where meaning is crucial.

In practice, choose stemming when speed is the priority and lemmatization when accuracy is essential.

For example, search engines like Elasticsearch often use stemming for quick indexing, while sentiment analysis benefits more from lemmatization to capture meaning accurately.

Combining stemming and lemmatization

Although stemming and lemmatization are usually used separately, they can occasionally be combined. A model might stem words first for speed, then apply lemmatization for accuracy. This adds extra complexity, so in practice it is rare. Most tasks are handled efficiently by choosing one method based on requirements.

Here’s a fun linguistic puzzle that can trip up tokenization, stemming, and lemmatization:

“The fisherman painted a bass on the wall.”
“The fisherman listened to the deep bass of the waves.”

The word “bass” has two very different meanings: one refers to a type of fish, while the other refers to a low-pitched sound. This shows how models struggle with context if preprocessing does not handle word sense disambiguation. A modern solution is contextual embeddings, which interpret a word’s meaning based on surrounding text. For example, they help the model tell whether “bass” refers to a fish or to sound. (We will explore this in detail later in the course.)

Stemming is often viewed as crude, while lemmatization is perceived as precise; however, both have their strengths and limitations. Stemming’s simplicity works well for tasks like search indexing, even with imperfect results. Lemmatization offers greater accuracy but depends on good linguistic resources. The choice depends on your application’s needs and the trade-offs involved.

What are some other preprocessing techniques?

In addition to tokenization, stemming, and lemmatization, several other preprocessing techniques enhance text data quality for NLP models:

Lowercasing: Standardizes text to reduce vocabulary size and improve efficiency.

Educative byte: There are times when “clean” text can also backfire. For example, lowercasing seems harmless, right? Not always. Take “Apple released the iPhone” vs. “I ate an apple.” Lowercasing both to “apple” erases the difference between a fruit and a trillion-dollar company. Similarly, stripping stop words like “not” from “not bad” flips the sentiment from neutral to positive!

Remove stop words: Drop common words like “the” and “is” to focus on meaning.
Strip punctuation: Removes symbols that add noise in many tasks.
Handle special characters and numbers: Keep or drop them based on the task.
Expand contractions: Turn “don’t” into “do not” for clearer meaning.
Fix common misspellings: Normalize typos to consistent forms.
Standardize abbreviations and acronyms: Expand “AI” to “Artificial Intelligence” when clarity matters.

Below is an example of simple Python code (using only built‐in functions) that demonstrates these preprocessing steps step by step:

Python 3.10.4

# Sample text containing various cases
text = "Apple released the iPhone! I didn't know that Apple's announcement would shock everyone. Don't you think it's amazing?"
print("Original Text:")
print(text)
print("-" * 100)
# 1. Lowercasing: Convert all text to lowercase
lower_text = text.lower()
print("After Lowercasing:")
print(lower_text)
print("-" * 100)
# 2. Tokenization: Split text into words (this simple approach splits on whitespace)
tokens = lower_text.split()
print("After Tokenization:")
print(tokens)
print("-" * 100)
# 3. Stripping Punctuation: Remove punctuation from each token
# Define a set of punctuation characters
punctuations = '.,!?\'":;()'
tokens = [token.strip(punctuations) for token in tokens]
print("After Removing Punctuation:")
print(tokens)
print("-" * 100)
# 4. Removing Stop Words: Filter out common, semantically insignificant words
stop_words = ['the', 'is', 'at', 'on', 'and', 'a', 'an', 'of', 'that', 'would', 'you', 'it']
tokens = [token for token in tokens if token not in stop_words]
print("After Removing Stop Words:")
print(tokens)
print("-" * 100)
# 5. Expanding Contractions: Replace contractions with their expanded forms
# Note: This is a simple dictionary for demonstration
contractions = {
    "didn't": "did not",
    "don't": "do not",
    "it's": "it is",
    "i'm": "i am",
    "i've": "i have",
    "apple's": "apple has"
}
expanded_tokens = []
for token in tokens:
    if token in contractions:
        # Split the expanded form to keep tokens consistent
        expanded_tokens.extend(contractions[token].split())
    else:
        expanded_tokens.append(token)
tokens = expanded_tokens
print("After Expanding Contractions:")
print(tokens)
print("-" * 100)
# 6. Handling Special Characters and Numbers:
# For this example, remove tokens that are purely numeric.
tokens = [token for token in tokens if not token.isdigit()]
print("After Handling Numbers:")
print(tokens)
print("-" * 100)
# 7. Correcting Misspellings:
# A very basic approach using a predefined dictionary of common corrections.
corrections = {
    "iphon": "iphone",  # Example: if a typo occurred
    # add more common misspellings as needed
}
tokens = [corrections.get(token, token) for token in tokens]
print("After Correcting Misspellings:")
print(tokens)
print("-" * 100)
# 8. Dealing with Abbreviations and Acronyms:
# Expand or standardize abbreviations using a simple mapping.
abbreviations = {
    "ai": "artificial intelligence",
    # add additional abbreviation mappings as needed
}
tokens = [abbreviations.get(token, token) for token in tokens]
print("After Expanding Abbreviations:")
print(tokens)
print("-" * 100)
# Final preprocessed tokens
print("Final Preprocessed Tokens:")
print(tokens)

These steps refine input data, reducing noise and inconsistencies and improving generative AI models’ accuracy and coherence.

Tackling embedded biases in real-world data

Real-world data often carries hidden biases. Image models may default to right-handed people, and text models trained mostly on English may miss dialects or underrepresented languages. Researchers reduce bias by balancing datasets, filtering harmful examples, and using data augmentation, but no dataset can ever be fully fair.

Groups like CommonCrawl gather diverse text to support more equitable AI, yet ongoing monitoring is always needed. Bias can appear at every stage, so recognizing and minimizing it is key. Preprocessing also sets the stage for advanced NLP methods like bag of words, TF-IDF, and word embeddings, which lead to models such as RNNs, Transformers, and GPT.

In upcoming lessons, we will explore these classical techniques and see how they evolve into the architectures that power modern AI systems such as GPT.

Ask

1.Introduction to Generative AI

2.Building Blocks of Generative AI

3.Foundation Models

Project

4.Intelligent Interaction with GenAI

5.Practical Applications and Case Studies

6.Future of Generative AI and Wrap Up

Text Preprocessing Essentials

The messy world of text

Why does text preprocessing matter?

What is tokenization?

Types of tokenizations

From tokenization to normalization

Types of normalization

What is stemming?

What is lemmatization?

Combining stemming and lemmatization

What are some other preprocessing techniques?

Tackling embedded biases in real-world data