Why use document-term matrices?

The following code lists the tokens and their frequencies:

R


# This displays leading n-grams ------------------------
shelleyText |>
  removePunctuation() |>
  removeWords(stopwords('english')) |>
  removeWords(c("I")) |>
  removeNumbers() |>
  stripWhitespace() |>
  Boost_tokenizer() |>
  vapply(paste, "", collapse=" ") |>
  table() |>
  sort(decreasing = TRUE) |>
  head(n = 10)

R

library(tm, quietly = TRUE) 
docDir <- DirSource(directory = "data",pattern = "mws_.+txt")
newCorpus <- Corpus(docDir)
DTmatrix <- DocumentTermMatrix(newCorpus, 
                     control = list(tolower = TRUE,
                                    stopwords = TRUE,
                                    stripWhiteSpace = TRUE, 
                                    removePunctuation = TRUE,
                                    removeNumbers = TRUE,
                                    tokenize = "Boost"
                                    )
                               )
inspect(DTmatrix)

1.Before We Begin

2.Important Concepts in Natural Language Processing

3.Text Mining Package

4.Understanding Corpora and Sources

5.Converting Text to Structured Data

6.Document Insights and Advanced Search Techniques

7.Working with Metadata in the tm Package

8.Implementing NLP with the quanteda Package

9.Implementing NLP with the tidytext Package

Assessment

10.Concluding Remarks

11.Appendix

Analyzing Textual Comparisons with Document-Term Matrices

Why use document-term matrices?