What is gensim.models.TfidfModel() function?

Gensim is an open-source Python library widely used in natural language processing (NLP) tasks like topic modeling and document similarity analysis. The TfidfModel is a fundamental component of Gensim. which stands for Term Frequency-Inverse Document Frequency.

The `gensim.models.TfidfModel()` function

The TfidfModel transforms a bag-of-words (BoW) representation of a document into a more meaningful and informative numerical representation.

Note: To learn more about Bag-of-Words (BoW) corpus, click here.

It assigns a weight to each word in a document based on its frequency, indicating its importance within a specific document (Term Frequency) and its overall significance across the entire corpus of documents (Inverse Document Frequency).

Syntax

The syntax to create a TfidfModel is given below:

from gensim import corpora
from gensim.models import TfidfModel
# Sample documents
documents = [
    "Machine learning is a subset of artificial intelligence.",
    "Natural language processing is used in various applications.",
    "Machine learning and NLP are essential in modern AI systems."
]
# Tokenize the documents and create a dictionary
text_tokens = [[text for text in doc.lower().split()] for doc in documents]
dictionary = corpora.Dictionary(text_tokens)
# Create a bag-of-words (BoW) representation for each document
corpus = [dictionary.doc2bow(text) for text in text_tokens]
# Create a TfidfModel
tfidf_model = TfidfModel(corpus, normalize=True)
# Transform the BoW representation into Tfidf representation
tfidf_representation = tfidf_model[corpus]
# Print the Tfidf representation for each document
for i, doc in enumerate(tfidf_representation):
    print(f"Document {i + 1}: {doc}")

Code explanation

Line 1–2: Firstly, we import the required modules and classes from Gensim like corpora and Tfidf model.
Line 5–9: Now, we create a list containing three sample text documents.
Line 12: In this line, we tokenize each document by splitting it into lowercase words and store the result in text_tokens.
Line 13: We create a Gensim dictionary from the tokenized documents here.
Line 16: In this line, we convert each tokenized document into a bag-of-words (BoW) representation using the doc2bow() method.
Line 19: Now we create a TfidfModel object using the BoW corpus and sets normalize=True to normalize the TF-IDF scores.
Line 22: Here, we transform the BoW representation of each document into its corresponding TF-IDF representation using the TfidfModel.
Line 25–26: In this line, we start a loop that iterates through the TF-IDF representation of each document. And finally, we print the document number (i + 1) and its TF-IDF representation.

The TF-IDF scores indicate the importance of each word within its respective document. The specific document considers Words with higher scores more important and distinctive. The normalization (set to True in this example) ensures that the TF-IDF scores are scaled to a unit length, making them comparable across different documents.

Output

Upon execution, each document in the code is represented by a list of tuples. Each tuple contains the word ID and its corresponding TF-IDF score.

Note: The Tfidf score is a real number ranging from 0 to 1. Higher values indicate higher importance of the word within the document.

The output looks something like this:

Document 1: [(0, 0.4299876883131281), (1, 0.4299876883131281), (2, 0.4299876883131281), (3, 0.15869566208696556), (4, 0.15869566208696556), (5, 0.15869566208696556), (6, 0.4299876883131281), (7, 0.4299876883131281)]
Document 2: [(3, 0.14736395618799455), (8, 0.3992843032295923), (9, 0.14736395618799455), (10, 0.3992843032295923), (11, 0.3992843032295923), (12, 0.3992843032295923), (13, 0.3992843032295923), (14, 0.3992843032295923)]
Document 3: [(4, 0.1355937998721238), (5, 0.1355937998721238), (9, 0.1355937998721238), (15, 0.3673929317907688), (16, 0.3673929317907688), (17, 0.3673929317907688), (18, 0.3673929317907688), (19, 0.3673929317907688), (20, 0.3673929317907688), (21, 0.3673929317907688)]

Advantages

There are several reasons why the TfidfModel is beneficial to implement. Some of the reasons are mentioned below:

Feature vector representation: TfidfModel transforms documents into feature vectors, allowing various similarity analysis and machine learning tasks.
Importance Weighting: This model assigns higher weights to words critical for a document and down weights common words that appear in many documents.
Dimensionality Reduction: TfidfModel reduces the dimensionality of the data, which can be especially useful in large text datasets.

Conclusion

In conclusion, the TfidfModel in Gensim allows NLP developers to represent and compare documents based on their importance and relevance to the entire corpus, providing valuable insights for various natural language processing tasks.

Free Resources