Search⌘ K
AI Features

Comparing R Packages for NLP

Understand the strengths and methodologies of leading R packages for NLP such as tm, tidytext, and quanteda. Learn how each package handles text data, tokenization, and preprocessing to help you select tools suited to your NLP projects and goals.

R packages for NLP

Modern programming languages have embraced extensibility, allowing for enhancing and specializing a base language functionality. R exemplifies this philosophy with packages that provide easy-to-use implementations of industry tools and techniques. Regarding natural language processing, developers have provided customized R packages designed around several philosophies and methodologies.

It’s critical to understand NLP concepts, and it’s equally important to choose R packages designed in a way that makes sense for our personal or team goals. There are many packages to choose from, and they all have different strengths. Fortunately, because all of these packages are based on the R programming language, they can be used together and often create compatible data structures.

The following is a brief overview of some of the more popular packages for NLP. This table doesn’t reflect the nuances of each package but only on generalizations. In addition, it doesn’t reflect the dependencies of one package upon another.

Features of R Packages for NLP


Tokenizer

Stop Words

Stemming

Lemmification

POS tagging

Sentiment Analysis

Tf-idf

Visualization

Requirements

coreNLP






Java

koRpus



NLP









openNLP







Java

quanteda


SnowballC









spacyr






Python

text2vec




tidytext




dplyr

tm





udpipe





The following is a brief example of three popular packages: Base R with tm, tidytext, and quanteda.

Base R with tm

The tm package is one of the oldest and best-known NLP packages for R. Here’s an example of how tm converts a directory of documents into a corpus, and then a document-term matrix:

R
library(tm, quietly = TRUE)
newSimpleCorpus <- SimpleCorpus(DirSource(directory = "data/",
pattern = "mws.+txt"))
DTmatrix <- DocumentTermMatrix(newSimpleCorpus,
control = list(tolower = TRUE,
stopwords = TRUE,
removePunctuation = TRUE,
removeNumbers = TRUE,
stemming = TRUE
)
)
inspect(DTmatrix)

In the code above, a corpus is created and then converted into a document-text matrix.

  • Lines 3–4: We create a simple corpus from the text files in the data directory.

  • Lines 6–13: We create a document-term matrix.

    • Line 7: This converts all words to lowercase.

    • Line 8: This removes stopwords.

    • Line 9: This removes punctuation.

    • Line 10: This removes numbers.

    • Line 11: This stems words.

  • Line 15: This line displays the resulting DTM.

The document-term matrix consists of one row per document and one column per token (or word). The corpus and the DTM are then used for a range of NLP tasks. We’ll discuss this in-depth in a later chapter.

Base R with tidytext

The tidytext package is designed to perform text mining and natural language processing in keeping with the tidyverse and its concepts.

Note: The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.

Compare the above example of base R to the following example of NLP with tidytext:

R
library(tidytext, quietly = TRUE)
library(readtext, quietly = TRUE)
library(dplyr, quietly = TRUE, warn.conflict = FALSE)
noNumbers <- readtext(file = "data/mws*txt") %>%
unnest_tokens(word, text) %>%
filter(is.na(as.numeric(word)))
print(noNumbers, n = 100)

The code above converts the text files in the data directory to a list of tokens that are compliant with tidyverse.

  • Line 5: Lists all text files in the data directory.

  • Line 6: Adds the word “data” to the path of each file.

  • Line 7: Reads each file, returning a data.frame.

  • Line 8: Tokenizes each file, resulting in a list of terms.

  • Line 9: Removes numbers.

  • Line 11: Prints the results.

Each word of the document has been placed in a single row, and all numbers have been removed. Note that tidytext doesn’t work with corpora—although it can convert a corpus produced with tm into a tidytext data object.

Base R with quanteda

The quanteda package is another popular framework for natural language processing. For many researchers, it’s considered the most potent open-source NLP tool available.

The following code demonstrates how to use quanteda to remove numbers from a corpus:

R
# install.packages("quanteda")
# install.packages("readtext")
library(quanteda)
library(readtext)
filesToRead <- list.files(path = "data", pattern = "mws.+txt")
textDF <- readtext(paste0("data/", filesToRead),
docvarsfrom = "filenames")
quantCorpus <- corpus(textDF)
tokens(quantCorpus,remove_numbers = TRUE)

In the code above, quanteda commands are used.

  • Line 7: Lists the text files in the data directory.

  • Line 9: Reads those files into a data.frame. docvarsfrom saves the file names as the document ID.

  • Line 11: Converts the data.frame into a quanteda corpus.

  • Line 14: Tokenizes the corpus.

The result shows the corpus with two documents. It also shows the first 12 tokens, with many more.

CRAN task view

The tm, tidytext, udpipe, and quanteda packages are the most popular text-mining frameworks, but many other tools are available for a range of tasks. The R Project for Statistical Computing provides a task view with a frequently updated description of NLP tools for the R programming language.

The CRAN task view for natural language processing is a complete description of tools and frameworks for your use. It includes sections on the following topics:

  • Frameworks (tm, tidytext, and others)

  • Words (databases of words, keyword extraction, and string manipulation)

  • Semantics (tools for analyzing the meaning of text)

  • Pragmatics (NLP tools for research)

  • Corpora (exploration and visualization of corpora)

  • CRAN packages (a list of R packages related to text mining and NLP)

  • Related links (Discussions and papers about NLP)

  • Other resources (ongoing development in the field of NLP)

You can find the CRAN NLP task view here.