Performing Natural Language Processing with R/

...

Lemmatization with tidytext

Learn how to perform text lemmatization using the tidytext package in R for improved text analysis.

We'll cover the following...

Lemmatization with tidytext
Explaining the lemmatization code
Pros and cons of lemma
Summarizing lemmatization with tidytext

Lemmatization with `tidytext`

The tidytext package relies on textstem::lemmatize_words for lemmatization. Lemmatization is a text preprocessing technique that involves reducing words to their base or root form, known as the lemma. When combined with the tidytext package in R, lemmatization becomes a straightforward process.

tidytext is an R package designed to perform text mining and analysis using the principles of tidy data. It provides functions and tools for manipulating and tidying text data, making it easier to work with.

Here’s code to perform lemmatization with tidytext:

Press + to interact

library(tidyverse)
library(tidytext)
library(readtext)
library(textstem)
library(SnowballC)
lemma_dictionary <- readtext(file = "data/mws*txt") %>%
  make_lemma_dictionary( engine = 'hunspell')
lemmafied <- readtext("data/mws*txt") %>%
  unnest_tokens(word, text) %>%
  mutate(stem = wordStem(word)) %>%
  mutate(lemm = lemmatize_words(word , dictionary = lemma_dictionary)) %>%
  filter(stem != lemm ) %>%
  select(-doc_id)
  
print(lemmafied[, c("word","stem","lemm")], n = 100)
lemmafied[7,c("word","stem","lemm")] # united vs unit vs unite
lemmafied[88,c("word","stem","lemm")] # disadvantages vs advantage

Before We Begin

Important Concepts in Natural Language Processing

Text Mining Package

Understanding Corpora and Sources

Converting Text to Structured Data

Document Insights and Advanced Search Techniques

Working with Metadata in the tm Package

Implementing NLP with the quanteda Package

Implementing NLP with the tidytext Package

Assess What You Have Learned About NLP

Concluding Remarks

Appendix

Lemmatization with tidytext

Lemmatization with `tidytext`

Explaining the lemmatization code