Stemming with tidytext
Learn how tidytext uses SnowballC and Hunspell to accomplish stemming.
We'll cover the following...
tidytext relies on other packages for stemming:
Stemming with SnowballC
The tidytext package doesn’t have specific stemming functions and instead relies on SnowballC and standard tidyverse commands. 
The SnowballC package in R is an interface to the Snowball stemming library, which is a collection of algorithms for various languages. These algorithms were developed by Martin Porter and are widely used in natural language processing tasks.
SnowballC includes functions such as wordStem(), which takes a word as input and returns its stemmed form using the selected stemming algorithm. This function supports multiple languages, allowing us to choose the appropriate stemming algorithm based on the language of our text data.
Here’s R code demonstrating the use of SnowballC with tidytext:
library(tidyverse, quietly = TRUE)library(tidytext, quietly = TRUE)library(readtext, quietly = TRUE)library(SnowballC, quietly = TRUE)stemmed <- readtext(file = "data/mws*txt") %>%unnest_tokens(word, text) %>%filter(!grepl('[[:digit:]]', word)) %>%anti_join(stop_words, by = "word") %>%mutate(stem = wordStem(word))print(stemmed, n = 100)
Here’s a breakdown of what each line does:
- Lines 1–4: These lines load several packages used in the code. Note the addition of - SnowballC, ...