Comparing R Packages for NLP
Understand the strengths and methodologies of leading R packages for NLP such as tm, tidytext, and quanteda. Learn how each package handles text data, tokenization, and preprocessing to help you select tools suited to your NLP projects and goals.
We'll cover the following...
R packages for NLP
Modern programming languages have embraced extensibility, allowing for enhancing and specializing a base language functionality. R exemplifies this philosophy with packages that provide easy-to-use implementations of industry tools and techniques. Regarding natural language processing, developers have provided customized R packages designed around several philosophies and methodologies.
It’s critical to understand NLP concepts, and it’s equally important to choose R packages designed in a way that makes sense for our personal or team goals. There are many packages to choose from, and they all have different strengths. Fortunately, because all of these packages are based on the R programming language, they can be used together and often create compatible data structures.
The following is a brief overview of some of the more popular packages for NLP. This table doesn’t reflect the nuances of each package but only on generalizations. In addition, it doesn’t reflect the dependencies of one package upon another.
Features of R Packages for NLP
Tokenizer | Stop Words | Stemming | Lemmification | POS tagging | Sentiment Analysis | Tf-idf | Visualization | Requirements | |
| ✅ | ✅ | ✅ | Java | |||||
| ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ||
| ✅ | ||||||||
| ✅ | ✅ | Java | ||||||
| ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | |
| ✅ | ||||||||
| ✅ | ✅ | ✅ | Python | |||||
| ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | |||
| ✅ | ✅ | ✅ | ✅ | ✅ | dplyr | |||
| ✅ | ✅ | ✅ | ✅ | ✅ | ||||
| ✅ | ✅ | ✅ | ✅ | ✅ |
The following is a brief example of three popular packages: Base R with tm, tidytext, and quanteda.
Base R with tm
The tm package is one of the oldest and best-known NLP packages for R. Here’s an example of how tm converts a directory of documents into a corpus, and then a document-term matrix:
In the code above, a corpus is created and then converted into a document-text matrix.
Lines 3–4: We create a simple corpus from the text files in the data directory.
Lines 6–13: We create a document-term matrix.
Line 7: This converts all words to lowercase.
Line 8: This removes stopwords.
Line 9: This removes punctuation.
Line 10: This removes numbers.
Line 11: This stems words.
Line 15: This line displays the resulting DTM.
The document-term matrix consists of one row per document and one column per token (or word). The corpus and the DTM are then used for a range of NLP tasks. We’ll discuss this in-depth in a later chapter.
Base R with tidytext
The tidytext package is designed to perform text mining and natural language processing in keeping with the tidyverse and its concepts.
Note: The
tidyverseis an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.
Compare the above example of base R to the following example of NLP with tidytext:
The code above converts the text files in the data directory to a list of tokens that are compliant with tidyverse.
Line 5: Lists all text files in the data directory.
Line 6: Adds the word “data” to the path of each file.
Line 7: Reads each file, returning a data.frame.
Line 8: Tokenizes each file, resulting in a list of terms.
Line 9: Removes numbers.
Line 11: Prints the results.
Each word of the document has been placed in a single row, and all numbers have been removed. Note that tidytext doesn’t work with corpora—although it can convert a corpus produced with tm into a tidytext data object.
Base R with quanteda
The quanteda package is another popular framework for natural language processing. For many researchers, it’s considered the most potent open-source NLP tool available.
The following code demonstrates how to use quanteda to remove numbers from a corpus:
In the code above, quanteda commands are used.
Line 7: Lists the text files in the data directory.
Line 9: Reads those files into a data.frame.
docvarsfromsaves the file names as the document ID.Line 11: Converts the data.frame into a
quantedacorpus.Line 14: Tokenizes the corpus.
The result shows the corpus with two documents. It also shows the first 12 tokens, with many more.
CRAN task view
The tm, tidytext, udpipe, and quanteda packages are the most popular text-mining frameworks, but many other tools are available for a range of tasks. The R Project for Statistical Computing provides a task view with a frequently updated description of NLP tools for the R programming language.
The CRAN task view for natural language processing is a complete description of tools and frameworks for your use. It includes sections on the following topics:
Frameworks (
tm,tidytext, and others)Words (databases of words, keyword extraction, and string manipulation)
Semantics (tools for analyzing the meaning of text)
Pragmatics (NLP tools for research)
Corpora (exploration and visualization of corpora)
CRAN packages (a list of R packages related to text mining and NLP)
Related links (Discussions and papers about NLP)
Other resources (ongoing development in the field of NLP)
You can find the CRAN NLP task view here.