Using a Suitable Corpus Class
Explore how to choose and use suitable corpus classes such as SimpleCorpus, VCorpus, PCorpus, and DCorpus with the tm package in R. Understand their characteristics, memory management, and appropriate use cases to handle various text sources and sizes, improving your ability to manage and analyze textual data effectively.
We'll cover the following...
Let’s do a deeper exploration of the corpora included as part of the tm package via plug-in packages.
Corpus
Corpus is a convenient alias to create either a SimpleCorpus or a VCorpus, depending on the arguments provided. For example, SimpleCorpus can’t contain XML, so if we were to use Corpus with XML, Corpus would create a VCorpus. Here is an example of Corpus:
This a simple example. At the top of the structure list, we’ll see a line listing the classes where it is listed as a SimpleCorpus. If the source had been anything other than DataframeSource, DirSource, or VectorSource, this would have been a VCorpus.
Here is the Corpus command with all arguments defined:
xis asourceobject.readerControlis a list of two components:readerandlanguage.The
readerfunction constructs a text document from the files identified byx....