Assigning Metadata

Learn to store metadata in VCorpus in R for document management.

Storing metadata in the corpus

The idea of two different places to store metadata is an easy concept. Sometimes metadata is in the corpus, and sometimes, it’s in one or more of the documents. In practice, placing metadata in one or the other isn’t obvious, and it’s not clear which should be used and for what purpose. Let’s explore the structure of a VCorpus to get a better picture of how metadata is stored.

To explore the metadata in a VCorpus, we first need to create a VCorpus with metadata. In each of the following examples, this process will be hidden within the code. Let’s understand what is happening.

Press + to interact
# create useful metadata
library(tm, quietly = TRUE)
library(readtext)
# import documents ----------------------------------
fileList <- list.files(path = "data",
pattern = "mws_.+txt",
full.names = TRUE)
# readtext returns a data.frame
aDataFrame <- readtext(fileList)
# extract metadata and add to aDataframe ------------
# The import is accomplished with the stringr package
# install.packages("stringr")
library(stringr)
aDFtags <- str_match_all(string = aDataFrame$text,
pattern = "(Title:|Author:|Release Date:|Language:) (.+)\\R")
for (eachRow in 1:nrow(aDataFrame)) {
for (eachListItem in 1:nrow(aDFtags[[1]])) {
aDataFrame[eachRow, aDFtags[[eachRow]][eachListItem, 2]] <-
aDFtags[[eachRow]][eachListItem, 3]
}
}
# create the corpus -------------------------
newVCorpus <- VCorpus(DataframeSource(aDataFrame))
meta(newVCorpus)

Let’s break the above code into three main parts for better understanding: importing documents (lines 5–11), extracting metadata (lines 13–24), and finally, creating the corpus (lines 27–28).

Import the documents

At this point in the code, aDataframe consists of just two columns: doc_id and text. There is no metadata available. We can confirm this by running the following code snippet:

Press + to interact
# create useful metadata
library(tm, quietly = TRUE)
library(readtext)
fileList <- list.files(path = "data",
pattern = "mws_.+txt",
full.names = TRUE)
# readtext returns a data.frame
aDataFrame <- readtext(fileList)
aDataFrame
  • Lines 5–7: We assign the names of files starting with mws_ in the data directory. ...