Using a Suitable Source Type
Explore how to select and use various source types for text corpora in R using the tm package. Understand how to import documents from data frames, directories, URLs, vectors, XML files, and ZIP archives to build and analyze corpora for natural language processing tasks.
We'll cover the following...
Introduction to sources
The tm package can import several types of documents with special functions called sources. The tm package comes with a set of sources for general-purpose work, but a developer can add additional sources through plug-ins. In this lesson, we’ll look at the sources included with tm .
The tm package provides getsources() to produce a list of available sources. Run the following code to list the available sources in this copy of tm:
Line 3:
getSources( )provides a list of sources.
Let’s explore each of these sources in depth.
DataframeSource
A DataframeSource is a data.frame where each row represents a document. The first column must be named “doc_id” and contain a unique string to identify the document, possibly a file name. The second column must be named “text” and contain the document’s contents. The following code creates a DataFrameSource and then creates a corpus from that source:
Line 4: This line sets the
DataDirectoryvariable to the string“data/”. It specifies the directory where the text files are located.Line 5: This line creates a character vector
fileListcontaining the names of files inDataDirectorythat match the specified pattern. In this case, it looks for files that start withmws_and end with.txt(such asmws_1.txtormws_2.txt).Line 8: This line uses the
readtext()function from thereadtextpackage to read the text content of the files specified infileList. Thereadtext()function returns adata.framewith two columns:text(the content of the text file) anddoc_id(the identifier of the document).The
paste0()function concatenatesDataDirectorywith the file names to form the complete paths to the files.
Line 11: This line checks whether the number of rows in the
aDataframedata frame is equal to the number of uniquedoc_idvalues. Thenrow()function returns the number of rows, whilelength(unique(aDataframe$doc_id))returns the number of uniquedoc_idvalues. ...