What is TF-IDF?

Explore the TF-IDF technique to measure the importance of words within documents relative to a corpus. Understand how term frequency and inverse document frequency combine to highlight distinctive terms. This lesson helps you grasp TF-IDF's calculation, strengths, limitations, and common applications in text analysis, search engines, and classification using R and Python.

We'll cover the following...

What is TF-IDF?

As data scientists working with text, we often need to move beyond simply counting words. Raw word counts tell us how frequently a term appears, but they do not tell us how meaningful it is. A word like “the” might appear hundreds of times across every document in our corpus, yet carry no discriminative value whatsoever. What we really need is a way to quantify not just frequency, but relevance.

This is exactly the problem TF-IDF was designed to solve.

What is TF-IDF?

TF-IDF stands for term frequency-inverse document frequency. It is a numerical statistic that reflects how important a word is to a specific document relative to an entire corpus. The more a word appears in a document but rarely appears across other documents, the higher its TF-IDF score and the more distinctive it is considered to be.

TF-IDF was originally developed in the 1970s for information retrieval and has since become one of the most widely used techniques in natural language processing. It sits at the foundation of many search engines, text classifiers, and recommendation systems because it strikes a practical balance between simplicity and effectiveness. Unlike more complex representations, TF-IDF requires no model training, no labeled data, and no deep learning infrastructure. It is purely statistical, yet surprisingly powerful.

TF-IDF formula

TF-IDF is the product of two separate scores: TF and IDF. Understanding each one individually makes the combined formula much easier to reason about.

Term frequency (TF)

Term frequency measures how often a word appears in a document. The intuition is straightforward. If a word appears frequently in a document, it is likely important to that document. However, to account for documents of different lengths, we normalize by the total number of words in the document.

For example, if the word “neural” appears 5 times in a document that contains 100 words, its TF score is $5/100 = 0.05$ .

Inverse document frequency (IDF)

Term frequency alone has a significant weakness. A word like “the” might appear very frequently in a single document, giving it a high TF score, but it also appears in every other document in the corpus. It is not a useful signal for distinguishing one document from another.

Inverse document frequency corrects for this by penalizing words that appear across many documents and rewarding words that appear in only a few.

A high TF-IDF score means the word appears frequently in a specific document but rarely across the corpus, making it a strong signal for what that document is about. A low score means the word is either rare within the document, common across the corpus, or both.

Example

Suppose we have a corpus of three documents:

Document 1: “Learn Python on Educative”
Document 2: “Master Python exercises on Educative”
Document 3: “Data Science tutorials on Educative”

The word “on” appears in all three documents. Its IDF score will be

IDF=log(3/3)=log(1)=0

It contributes nothing to any document’s TF-IDF representation, regardless of how often it appears. This is the IDF penalty working exactly as intended.

The word “Python” appears only in Documents 1 and 2. Its TF is 0.25 in Document 1 and 0.2 in Document 2. Its IDF score is

IDF = log(3/2) ≈ 0.176

giving it a moderate TF-IDF score of approximately $0.044$ in Document 1 and $0.035$ in Document 2. This shows that “Python” contributes moderately to identifying these documents but is not unique to a single one.

The word “Data” appears only in Document 3. Its TF is 0.2, and its IDF score is

IDF = log(3/1)≈ 0.477

producing a TF-IDF score of approximately 0.095. This makes it a strong identifier for Document 3.

This demonstrates TF-IDF in action: it highlights the words that make each document unique while suppressing commonly shared words.

Python

# Step 1: Define the corpus
documents = [
    "Learn Python on Educative",
    "Master Python exercises on Educative",
    "Data Science tutorials on Educative"
]
# Step 2: Tokenize documents into words
words_list = [doc.split() for doc in documents]
# Step 3: Compute term frequency (TF) for each document
tf_list = []
for words in words_list:
    tf_doc = {}
    total_words = len(words)
    for word in words:
        tf_doc[word] = tf_doc.get(word, 0) + 1
    # Normalize by total words
    for word in tf_doc:
        tf_doc[word] /= total_words
    tf_list.append(tf_doc)
# Step 4: Compute document frequency (DF) for each unique word
all_words = set(word for words in words_list for word in words)
df = {}
for word in all_words:
    df[word] = sum(1 for words in words_list if word in words)
# Step 5: Compute IDF (log base 10)
import math
N = len(documents)
idf = {word: math.log10(N / df[word]) for word in all_words}
# Step 6: Compute TF-IDF for each document
tfidf_list = []
for tf_doc in tf_list:
    tfidf_doc = {word: tf_doc[word] * idf[word] for word in tf_doc}
    tfidf_list.append(tfidf_doc)
# Step 7: Display TF-IDF values
for i, tfidf_doc in enumerate(tfidf_list):
    print(f"\nDocument {i+1}")
    for word, score in tfidf_doc.items():
        print(f"{word}: {score:.3f}")

Computing TF-IDF in Python

Advantages of TF-IDF

TF-IDF is a simple yet effective baseline for text representation that has stood the test of time across many NLP applications.

Simple to implement: TF-IDF requires no model training and is computationally inexpensive, making it a reliable first step for any text analysis task. It can be applied to a corpus of any size with minimal setup.
Handles common words automatically: The IDF component naturally down-weights stop words without needing an explicit stop word list, though combining TF-IDF with stop word removal can further improve results.
Interpretable: Unlike dense word embeddings, TF-IDF scores are directly interpretable. A higher score simply means a more distinctive word within that document relative to the corpus.
Works well for keyword extraction: TF-IDF is one of the most effective methods for identifying the key terms that characterize a document, making it a go-to technique for summarization and indexing tasks.
Language agnostic: Because TF-IDF operates purely on token counts, it works across any language without requiring language-specific resources or preprocessing tools.

Disadvantages of TF-IDF

However, TF-IDF has real limitations that become apparent as text tasks grow more complex.

Ignores word order and context: TF-IDF treats a document as a bag of words. It has no understanding of grammar, sentence structure, or the meaning behind word combinations. The sentences “the dog bit the man” and “the man bit the dog” would produce identical TF-IDF representations.
Struggles with synonyms: Two words with the same meaning but different spellings, such as “car” and ”automobile”, are treated as completely unrelated terms and assigned independent scores.
Sparse representation: For large vocabularies, TF-IDF produces very high-dimensional sparse matrices that can be memory-intensive and slow to process in downstream models.
No semantic understanding: Unlike modern word embeddings such as Word2Vec or BERT, TF-IDF captures no semantic relationships between words. It cannot understand that “king” and “queen” are related concepts.
Sensitive to corpus size: IDF scores are computed relative to the corpus provided. A small or unrepresentative corpus can produce misleading scores that do not generalise well to new documents.

Applications of TF-IDF

TF-IDF has remained a foundational technique across many real-world text processing tasks and continues to be used both as a standalone tool and as a preprocessing step for more advanced models.

Search engines: TF-IDF is used to rank documents by relevance to a search query by identifying which documents contain the most distinctive occurrences of the query terms. Early versions of Google's ranking algorithm incorporated TF-IDF as a core component.
Text classification: TF-IDF vectors are commonly used as input features for machine learning classifiers in tasks like spam detection, sentiment analysis, and topic categorisation. When combined with models like logistic regression or support vector machines, TF-IDF representations often produce competitive results.
Keyword extraction: By ranking words within a document by their TF-IDF score, we can automatically identify the most representative keywords without any supervision or labeled data.
Document similarity: Comparing TF-IDF vectors across documents using cosine similarity allows us to measure how similar two pieces of text are. This is useful in recommendation systems, plagiarism detection, and document clustering.
Information retrieval: TF-IDF is widely used in building document retrieval systems where the goal is to return the most relevant documents from a large collection in response to a user query.

Conclusion

TF-IDF remains one of the most practical and interpretable tools in the NLP toolkit. By combining term frequency with inverse document frequency, it gives us a principled way to identify what makes each document distinctive within a corpus. While modern approaches like word embeddings and transformer models have pushed the boundaries of what is possible with text data, TF-IDF continues to serve as a strong, lightweight baseline that is easy to implement, easy to interpret, and effective across a wide range of tasks. Understanding it deeply is an essential stepping stone toward more advanced text representation techniques and a skill that every data scientist working with text should have in their toolkit.

1.Before We Begin

2.Important Concepts in Natural Language Processing

3.Text Mining Package

4.Understanding Corpora and Sources

5.Converting Text to Structured Data

6.Document Insights and Advanced Search Techniques

7.Working with Metadata in the tm Package

8.Implementing NLP with the quanteda Package

9.Implementing NLP with the tidytext Package

Assessment

10.Concluding Remarks

11.Appendix

What is TF-IDF?

What is TF-IDF?

TF-IDF formula

Term frequency (TF)

Inverse document frequency (IDF)

TF-IDF Score

Example

Implementing TF-IDF in Python/R

Advantages of TF-IDF

Disadvantages of TF-IDF

Applications of TF-IDF

Conclusion