Using the File Folder as Corpus

Learn about using files and folders as SimpleCorpus.

The documentation for tm is nearly 60 pages long and immediately dives into the mechanics of NLP. Rather than trying to understand the entire depth of this package in one go, let’s break it down into understandable and related components. The tm package can be broken down into these main topics:

  • Corpora and sources

  • Metadata

  • Preprocessing: Cleaning, stopwords, and stemming

  • Tokenizing: Words, n-grams, weighting

  • Statistics: Term frequency

  • Visualization

In this lesson, we’ll use Frankenstein as a base for our project. Our first task is to import text into a corpus.

VCorpus and SimpleCorpus

Natural language processing and text mining are done on a collection of documents, and this collection is called a corpus. The creation of a corpus is the first step to natural language processing with tm. Documents are imported into a corpus with the corpus family of commands. The different corpus commands produce different types of corpus.

There are two main versions:

  • VCorpus (volatile corpus)

  • SimpleCorpus (similar to VCorpus)

Here’s how to create a VCorpus:

Get hands-on with 1200+ tech skills courses.