Lemmatization with spaCy

In Natural Language Processing (NLP), dealing with text data efficiently is essential for tasks like text classification, sentiment analysis, and information retrieval. One of the fundamental preprocessing steps in NLP is lemmatization. Lemmatization helps reduce different word forms to a common base form, simplifying text analysis. In this Answer, we’ll explore the concept of lemmatization, its importance, and how we can use it using the popular NLP library, spaCy.

What is lemmatization?

A lemma is the base form of a token. For example, the lemma of the word eating is eat. The lemma represents the canonical or dictionary form of a word, which helps group together win grouping together words with similar meanings. It is commonly used in NLP tasks such as text classification, information retrieval, language modeling, etc.

Lemmatization using spaCy

spaCy is a popular NLP library in Python and provides elegant solutions for various NLP and ML-related tasks, including lemmatization. For this task, we can use the built-in lemmatizer of spaCy itself. Let’s see how we can achieve this:

Let’s go over the code above.

Lines 1–2: We import the spacy library and load the en_core_web_md language model. The en_core_web_md model includes a comprehensive vocabulary, word vectors, POS tags, and syntactic dependencies based on web texts, making it suitable for general-purpose text processing. The disable=["ner", "parser"] parameter here is used to disable the named entity recognition (NER) and syntactic parser components of the spaCy pipeline since we are not using it. This avoids the memory to be overloaded.
Line 3: We create a doc object and added our example sentence in it.
Lines 4–5: We declare a loop that will loop through all the tokens in the doc and print the text in the token and the lemma of the token.

Batch processing with spaCy

Let’s look at another example where we’ll process a large text in smaller batches to avoid memory overload. We’ll use the en_core_web_md model and process text in chunks.

Here’s the line-by-line explanation of the code example above:

Line 1: We import the spacy library. and define a function named process_text_in_batches.
Line 4: We define a function named process_text_in_batches.
Line 5: We load the en_core_web_md language model.
Line 6: We declare a loop that will iterate through the text in steps of batch_size.
Line 7: We create a batch of the text with the specified batch size.
Line 8: We process the batch with the spaCy model to create a doc object.
Line 9: We declare a loop that will iterate through all the tokens in the doc.
Line 10: We print the text of the token and the lemma of the token.
Line 13: We create a sample large text by repeating a sentence $1000$ times.
Line 16: We call the process_text_in_batches function with the sample text and a batch size of $100$ characters.