Lemmatization with spaCy
In Natural Language Processing (NLP), dealing with text data efficiently is essential for tasks like text classification, sentiment analysis, and information retrieval. One of the fundamental preprocessing steps in NLP is lemmatization. Lemmatization helps reduce different word forms to a common base form, simplifying text analysis. In this Answer, we’ll explore the concept of lemmatization, its importance, and how we can use it using the popular NLP library, spaCy.
What is lemmatization?
A lemma is the base form of a token. For example, the lemma of the word eating is eat. The lemma represents the canonical or dictionary form of a word, which helps group together win grouping together words with similar meanings. It is commonly used in NLP tasks such as text classification, information retrieval, language modeling, etc.
Lemmatization using spaCy
spaCy is a popular NLP library in Python and provides elegant solutions for various NLP and ML-related tasks, including lemmatization. For this task, we can use the built-in lemmatizer of spaCy itself. Let’s see how we can achieve this:
import spacy
nlp = spacy.load("en_core_web_md", disable=["ner", "parser"])
doc = nlp("I have been working at this place for many years")
for token in doc:
print(token.text, token.lemma_)
Let’s go over the code above.
Lines 1–2: We import the
spacylibrary and load theen_core_web_mdlanguage model. Theen_core_web_mdmodel includes a comprehensive vocabulary, word vectors, POS tags, and syntactic dependencies based on web texts, making it suitable for general-purpose text processing. Thedisable=["ner", "parser"]parameter here is used to disable the named entity recognition (NER) and syntactic parser components of the spaCy pipeline since we are not using it. This avoids the memory to be overloaded.Line 3: We create a
docobject and added our example sentence in it.Lines 4–5: We declare a loop that will loop through all the
tokensin thedocand print thetextin the token and the lemma of the token.
Batch processing with spaCy
Let’s look at another example where we’ll process a large text in smaller batches to avoid memory overload. We’ll use the en_core_web_md model and process text in chunks.
import spacy# Function to process text in batchesdef process_text_in_batches(text, batch_size=10):nlp = spacy.load("en_core_web_md")for i in range(0, len(text), batch_size):batch = text[i:i + batch_size]doc = nlp(batch)for token in doc:print(token.text, token.lemma_)# Sample large texttext = ("I have been working at this place for many years. " * 1000)# Process the text in batches of 100 charactersprocess_text_in_batches(text, batch_size=100)
Here’s the line-by-line explanation of the code example above:
Line 1: We import the
spacylibrary. and define a function namedprocess_text_in_batches.Line 4: We define a function named
process_text_in_batches.Line 5: We load the
en_core_web_mdlanguage model.Line 6: We declare a loop that will iterate through the text in steps of
batch_size.Line 7: We create a batch of the text with the specified batch size.
Line 8: We process the batch with the spaCy model to create a
docobject.Line 9: We declare a loop that will iterate through all the tokens in the
doc.Line 10: We print the text of the token and the lemma of the token.
Line 13: We create a sample large text by repeating a sentence
times. Line 16: We call the
process_text_in_batchesfunction with the sample text and a batch size ofcharacters.
Free Resources