Search⌘ K
AI Features

Using a Suitable Source Type

Explore how to select and use various source types for text corpora in R using the tm package. Understand how to import documents from data frames, directories, URLs, vectors, XML files, and ZIP archives to build and analyze corpora for natural language processing tasks.

Introduction to sources

The tm package can import several types of documents with special functions called sources. The tm package comes with a set of sources for general-purpose work, but a developer can add additional sources through plug-ins. In this lesson, we’ll look at the sources included with tm .

The tm package provides getsources() to produce a list of available sources. Run the following code to list the available sources in this copy of tm:

R
library(tm, quietly = TRUE)
getSources()
  • Line 3: getSources( ) provides a list of sources.

Let’s explore each of these sources in depth.

DataframeSource

A DataframeSource is a data.frame where each row represents a document. The first column must be named “doc_id” and contain a unique string to identify the document, possibly a file name. The second column must be named “text” and contain the document’s contents. The following code creates a DataFrameSource and then creates a corpus from that source:

R
library(tm, quietly = TRUE)
library(readtext)
DataDirectory <- "data/"
fileList <- dir(path = DataDirectory, pattern = "mws_.+txt")
# readtext returns a data.frame
aDataframe <- readtext(paste0(DataDirectory, fileList))
# This code confirms the doc_id is unique --------
if (nrow(aDataframe) == length(unique(aDataframe$doc_id))) {
message("doc_id is unique")
} else {
stop("doc_id is not unique")
}
aCorpus <- Corpus(DataframeSource(aDataframe))
summary(aCorpus)
  • Line 4: This line sets the DataDirectory variable to the string “data/”. It specifies the directory where the text files are located.

  • Line 5: This line creates a character vector fileList containing the names of files in DataDirectory that match the specified pattern. In this case, it looks for files that start with mws_ and end with .txt (such as mws_1.txt or mws_2.txt).

  • Line 8: This line uses the readtext() function from the readtext package to read the text content of the files specified in fileList. The readtext() function returns a data.frame with two columns:

    • text (the content of the text file) and doc_id (the identifier of the document).

    • The paste0() function concatenates DataDirectory with the file names to form the complete paths to the files.

  • Line 11: This line checks whether the number of rows in the aDataframe data frame is equal to the number of unique doc_id values. The nrow() function returns the number of rows, while length(unique(aDataframe$doc_id)) returns the number of unique doc_id values. ...