Searching for Matching Documents with tf-idf
Explore how tf-idf helps identify the most relevant documents by measuring term frequency relative to how rare terms are in a collection. Understand how to compute and apply tf-idf using the tm package in R to rank documents for matching queries and improve natural language processing tasks.
We'll cover the following...
Playing a game with documents
There is a common children’s game called “I Spy.” A group sits in a circle, and the leader says, “I spy, with my little eye, something blue.” Everyone else would then try to guess what the leader was looking at. Was it the blue telephone? Or perhaps the blue couch?
Natural language processing is often similar to this game. Given a document or a word, we have to determine the best-matching document from a list of documents. This is exactly what is done with an internet search or spam filtering.
There are many strategies for this type of search. One of the most common is called term frequency-inverse document frequency or tf-idf.
Note: TF–IDF, TF*IDF, TFIDF, TF–IDF, and Tf–idf all refer to the same concept, which is term frequency-inverse document frequency. We can use any of these forms interchangeably. ...