R packages for NLP

Modern programming languages have embraced extensibility, allowing for enhancing and specializing a base language functionality. R exemplifies this philosophy with packages that provide easy-to-use implementations of industry tools and techniques. Regarding natural language processing, developers have provided customized R packages designed around several philosophies and methodologies.

It’s critical to understand NLP concepts, and it’s equally important to choose R packages designed in a way that makes sense for our personal or team goals. There are many packages to choose from, and they all have different strengths. Fortunately, because all of these packages are based on the R programming language, they can be used together and often create compatible data structures.

The following is a brief overview of some of the more popular packages for NLP. This table doesn’t reflect the nuances of each package but only on generalizations. In addition, it doesn’t reflect the dependencies of one package upon another.

R

library(tm, quietly = TRUE) 
newSimpleCorpus <- SimpleCorpus(DirSource(directory = "data/",
                    pattern = "mws.+txt"))
DTmatrix <- DocumentTermMatrix(newSimpleCorpus, 
                               control = list(tolower = TRUE,
                                              stopwords = TRUE, 
                                              removePunctuation = TRUE,
                                              removeNumbers = TRUE,
                                              stemming = TRUE
                               )
)
inspect(DTmatrix)

In the code above, a corpus is created and then converted into a document-text matrix.

Lines 3–4: We create a simple corpus from the text files in the data directory.
Lines 6–13: We create a document-term matrix.
- Line 7: This converts all words to lowercase.
- Line 8: This removes stopwords.
- Line 9: This removes punctuation.
- Line 10: This removes numbers.
- Line 11: This stems words.
Line 15: This line displays the resulting DTM.

The document-term matrix consists of one row per document and one column per token (or word). The corpus and the DTM are then used for a range of NLP tasks. We’ll discuss this in-depth in a later chapter.

Base R with `tidytext`

The tidytext package is designed to perform text mining and natural language processing in keeping with the tidyverse and its concepts.

Note: The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.

Compare the above example of base R to the following example of NLP with tidytext:

The code above converts the text files in the data directory to a list of tokens that are compliant with tidyverse.

Line 5: Lists all text files in the data directory.
Line 6: Adds the word “data” to the path of each file.
Line 7: Reads each file, returning a data.frame.
Line 8: Tokenizes each file, resulting in a list of terms.
Line 9: Removes numbers.
Line 11: Prints the results.

Each word of the document has been placed in a single row, and all numbers have been removed. Note that tidytext doesn’t work with corpora—although it can convert a corpus produced with tm into a tidytext data object.

Base R with `quanteda`

The quanteda package is another popular framework for natural language processing. For many researchers, it’s considered the most potent open-source NLP tool available.

The following code demonstrates how to use quanteda to remove numbers from a corpus:

In the code above, quanteda commands are used.

Line 7: Lists the text files in the data directory.
Line 9: Reads those files into a data.frame. docvarsfrom saves the file names as the document ID.
Line 11: Converts the data.frame into a quanteda corpus.
Line 14: Tokenizes the corpus.

The result shows the corpus with two documents. It also shows the first 12 tokens, with many more.

CRAN task view

The tm, tidytext, udpipe, and quanteda packages are the most popular text-mining frameworks, but many other tools are available for a range of tasks. The R Project for Statistical Computing provides a task view with a frequently updated description of NLP tools for the R programming language.

The CRAN task view for natural language processing is a complete description of tools and frameworks for your use. It includes sections on the following topics:

Frameworks (tm, tidytext, and others)
Words (databases of words, keyword extraction, and string manipulation)
Semantics (tools for analyzing the meaning of text)
Pragmatics (NLP tools for research)
Corpora (exploration and visualization of corpora)
CRAN packages (a list of R packages related to text mining and NLP)
Related links (Discussions and papers about NLP)
Other resources (ongoing development in the field of NLP)

You can find the CRAN NLP task view here.

	Tokenizer	Stop Words	Stemming	Lemmification	POS tagging	Sentiment Analysis	Tf-idf	Visualization	Requirements
`coreNLP`	✅				✅	✅			Java
`koRpus`	✅	✅	✅	✅	✅		✅	✅
`NLP`	✅
`openNLP`	✅				✅				Java
`quanteda`	✅	✅	✅	✅	✅	✅	✅	✅
`SnowballC`			✅
`spacyr`	✅			✅	✅				Python
`text2vec`	✅	✅	✅	✅			✅	✅
`tidytext`	✅	✅			✅	✅	✅		dplyr
`tm`	✅	✅	✅				✅	✅
`udpipe`	✅			✅	✅	✅	✅

1.Before We Begin

2.Important Concepts in Natural Language Processing

3.Text Mining Package

4.Understanding Corpora and Sources

5.Converting Text to Structured Data

6.Document Insights and Advanced Search Techniques

7.Working with Metadata in the tm Package

8.Implementing NLP with the quanteda Package

9.Implementing NLP with the tidytext Package

Assessment

10.Concluding Remarks

11.Appendix

Comparing R Packages for NLP

R packages for NLP

Features of R Packages for NLP

Base R with `tm`

Base R with `tidytext`

Base R with `quanteda`

CRAN task view

Comparing R Packages for NLP

R packages for NLP

Features of R Packages for NLP

Base R with tm

Base R with tidytext

Base R with quanteda

CRAN task view

Base R with `tm`

Base R with `tidytext`

Base R with `quanteda`