Assigning Metadata
Learn to store metadata in VCorpus in R for document management.
We'll cover the following...
Storing metadata in the corpus
The idea of two different places to store metadata is an easy concept. Sometimes metadata is in the corpus, and sometimes, it’s in one or more of the documents. In practice, placing metadata in one or the other isn’t obvious, and it’s not clear which should be used and for what purpose. Let’s explore the structure of a VCorpus
to get a better picture of how metadata is stored.
To explore the metadata in a VCorpus
, we first need to create a VCorpus
with metadata. In each of the following examples, this process will be hidden within the code. Let’s understand what is happening.
# create useful metadatalibrary(tm, quietly = TRUE)library(readtext)# import documents ----------------------------------fileList <- list.files(path = "data",pattern = "mws_.+txt",full.names = TRUE)# readtext returns a data.frameaDataFrame <- readtext(fileList)# extract metadata and add to aDataframe ------------# The import is accomplished with the stringr package# install.packages("stringr")library(stringr)aDFtags <- str_match_all(string = aDataFrame$text,pattern = "(Title:|Author:|Release Date:|Language:) (.+)\\R")for (eachRow in 1:nrow(aDataFrame)) {for (eachListItem in 1:nrow(aDFtags[[1]])) {aDataFrame[eachRow, aDFtags[[eachRow]][eachListItem, 2]] <-aDFtags[[eachRow]][eachListItem, 3]}}# create the corpus -------------------------newVCorpus <- VCorpus(DataframeSource(aDataFrame))meta(newVCorpus)
Let’s break the above code into three main parts for better understanding: importing documents (lines 5–11), extracting metadata (lines 13–24), and finally, creating the corpus (lines 27–28).
Import the documents
At this point in the code, aDataframe
consists of just two columns: doc_id
and text
. There is no metadata available. We can confirm this by running the following code snippet:
# create useful metadatalibrary(tm, quietly = TRUE)library(readtext)fileList <- list.files(path = "data",pattern = "mws_.+txt",full.names = TRUE)# readtext returns a data.frameaDataFrame <- readtext(fileList)aDataFrame
Lines 5–7: We assign the names of files starting with
mws_
in the data directory. ...