Search⌘ K
AI Features

Lemmatization with tidytext

Explore how to perform lemmatization in R using the tidytext package combined with textstem. Understand the process of reducing words to their base forms, compare stemming and lemmatization results, and learn best practices for applying these techniques in text mining projects.

Lemmatization with tidytext

The tidytext package relies on textstem::lemmatize_words for lemmatization. Lemmatization is a text preprocessing technique that involves reducing words to their base or root form, known as the lemma. When combined with the tidytext package in R, lemmatization becomes a straightforward process.

tidytext is an R package designed to perform text mining and analysis using the principles of tidy data. It provides functions and tools for manipulating and tidying text data, making it easier to work with.

Here’s code to perform lemmatization with tidytext:

R
library(tidyverse)
library(tidytext)
library(readtext)
library(textstem)
library(SnowballC)
lemma_dictionary <- readtext(file = "data/mws*txt") %>%
make_lemma_dictionary( engine = 'hunspell')
lemmafied <- readtext("data/mws*txt") %>%
unnest_tokens(word, text) %>%
mutate(stem = wordStem(word)) %>%
mutate(lemm = lemmatize_words(word , dictionary = lemma_dictionary)) %>%
filter(stem != lemm ) %>%
select(-doc_id)
print(lemmafied[, c("word","stem","lemm")], n = 100)
lemmafied[7,c("word","stem","lemm")] # united vs unit vs unite
lemmafied[88,c("word","stem","lemm")] # disadvantages vs advantage

Explaining the lemmatization code

The code above demonstrates how to perform lemmatization with tidytext.

  • Lines 1–5: The library() function is used to load the required libraries (tidyverse, tidytext, ...