Performing Natural Language Processing with R/

...

Searching for Matching Documents with tf-idf

Learn about tf-idf and how it is calculated and used in information retrieval, search engines, and other NLP applications.

We'll cover the following...

Playing a game with documents
Understanding tf-idf
Summary

Playing a game with documents

There is a common children’s game called “I Spy.” A group sits in a circle, and the leader says, “I spy, with my little eye, something blue.” Everyone else would then try to guess what the leader was looking at. Was it the blue telephone? Or perhaps the blue couch?

Natural language processing is often similar to this game. Given a document or a word, we have to determine the best-matching document from a list of documents. This is exactly what is done with an internet search or spam filtering.

There are many strategies for this type of search. One of the most common is called term frequency-inverse document frequency or tf-idf.

Note: TF–IDF, TF*IDF, TFIDF, TF–IDF, and Tf–idf all refer to the same concept, which is term frequency-inverse document frequency. We can use any of these forms interchangeably. ...

Before We Begin

Important Concepts in Natural Language Processing

Text Mining Package

Understanding Corpora and Sources

Converting Text to Structured Data

Document Insights and Advanced Search Techniques

Working with Metadata in the tm Package

Implementing NLP with the quanteda Package

Implementing NLP with the tidytext Package

Assess What You Have Learned About NLP

Concluding Remarks

Appendix

Searching for Matching Documents with tf-idf

Playing a game with documents