Search⌘ K
AI Features

Using a Suitable Corpus Class

Explore how to choose and use suitable corpus classes such as SimpleCorpus, VCorpus, PCorpus, and DCorpus with the tm package in R. Understand their characteristics, memory management, and appropriate use cases to handle various text sources and sizes, improving your ability to manage and analyze textual data effectively.

Let’s do a deeper exploration of the corpora included as part of the tm package via plug-in packages.

Corpus

Corpus is a convenient alias to create either a SimpleCorpus or a VCorpus, depending on the arguments provided. For example, SimpleCorpus can’t contain XML, so if we were to use Corpus with XML, Corpus would create a VCorpus. Here is an example of Corpus:

R
library(tm, quietly = TRUE)
docDir <- DirSource(directory = "data",
pattern = "mws_.+txt")
newCorpus <- Corpus(docDir)
# show structure of the new corpus
str(newCorpus)

This a simple example. At the top of the structure list, we’ll see a line listing the classes where it is listed as a SimpleCorpus. If the source had been anything other than DataframeSource, DirSource, or VectorSource, this would have been a VCorpus.

Here is the Corpus command with all arguments defined:

R
newVCorpus <- Corpus(
x = DirSource(directory = "data",
pattern = "mws_.+txt"),
readerControl = list(reader = readDataframe,
language = "en"),
)
  • x is a source object.

  • readerControl is a list of two components: reader and language.

    • The reader function constructs a text document from the files identified by x.

      • ...