Search⌘ K
AI Features

Implement tf-idf with tm

Explore how to implement term frequency-inverse document frequency (tf-idf) using the tm package in R. Understand how tf-idf helps identify the importance of terms within documents, rank documents by relevance, and distinguish key topics by analyzing weighted term frequencies.

We discussed tf-idf or term frequency-inverse document frequency. This is a tool that identifies the importance of a word (or token) in a document. With it, we can select a token, and then draw assumptions on which document it most likely came from.

Let’s learn more about creating tf-idf, and then how to use it.

Calculating tf-idf with tm

Different packages have implemented different methods for calculating tf-idf. In the case of the tm package, it’s done when creating a DTM. Other packages perform this in different ways.

Here is code to illustrate the creation of tf-idf: ...

R
library(tm, quietly = TRUE)
newCorpus <- VCorpus(DataframeSource(compareText))
DTmatrix <- DocumentTermMatrix(newCorpus,
control = list(tolower = TRUE,
#stopwords = TRUE,
stripWhiteSpace = TRUE,
removePunctuation = TRUE,
removeNumbers = TRUE,
weighting = weightTfIdf,
#dictionary = c("garden"),
tokenize = "Boost"
)
)
inspect(DTmatrix)

When we run this ...