What is gensim.models.TfidfModel() function?
Gensim is an open-source Python library widely used in natural language processing (NLP) tasks like topic modeling and document similarity analysis. The TfidfModel is a fundamental component of Gensim. which stands for Term Frequency-Inverse Document Frequency.
The gensim.models.TfidfModel() function
The TfidfModel transforms a bag-of-words (BoW) representation of a document into a more meaningful and informative numerical representation.
Note: To learn more about Bag-of-Words (BoW) corpus, click here.
It assigns a weight to each word in a document based on its frequency, indicating its importance within a specific document (Term Frequency) and its overall significance across the entire corpus of documents (Inverse Document Frequency).
Syntax
The syntax to create a TfidfModel is given below:
tfidf_model = TfidfModel(corpus, normalize=True)
corpusis a required parameter defining the bag-of-words (BoW) representation of the documents.normalizeis an optional parameter set toTrueto normalize theTfidfscores.
Note: Make sure you have the Gensim library installed (you can install it using pip install gensim).
Code
Let's look at an example where we implement the function gensim.models.TfidfModel() in the given code:
from gensim import corporafrom gensim.models import TfidfModel# Sample documentsdocuments = ["Machine learning is a subset of artificial intelligence.","Natural language processing is used in various applications.","Machine learning and NLP are essential in modern AI systems."]# Tokenize the documents and create a dictionarytext_tokens = [[text for text in doc.lower().split()] for doc in documents]dictionary = corpora.Dictionary(text_tokens)# Create a bag-of-words (BoW) representation for each documentcorpus = [dictionary.doc2bow(text) for text in text_tokens]# Create a TfidfModeltfidf_model = TfidfModel(corpus, normalize=True)# Transform the BoW representation into Tfidf representationtfidf_representation = tfidf_model[corpus]# Print the Tfidf representation for each documentfor i, doc in enumerate(tfidf_representation):print(f"Document {i + 1}: {doc}")
Code explanation
Line 1–2: Firstly, we import the required modules and classes from Gensim like
corporaandTfidfmodel.Line 5–9: Now, we create a list containing three sample text documents.
Line 12: In this line, we tokenize each document by splitting it into lowercase words and store the result in
text_tokens.Line 13: We create a Gensim dictionary from the tokenized documents here.
Line 16: In this line, we convert each tokenized document into a bag-of-words (BoW) representation using the
doc2bow()method.Line 19: Now we create a
TfidfModelobject using the BoW corpus and setsnormalize=Trueto normalize the TF-IDF scores.Line 22: Here, we transform the BoW representation of each document into its corresponding TF-IDF representation using the
TfidfModel.Line 25–26: In this line, we start a loop that iterates through the TF-IDF representation of each document. And finally, we print the document number (
i + 1) and its TF-IDF representation.
The TF-IDF scores indicate the importance of each word within its respective document. The specific document considers Words with higher scores more important and distinctive. The normalization (set to True in this example) ensures that the TF-IDF scores are scaled to a unit length, making them comparable across different documents.
Output
Upon execution, each document in the code is represented by a list of tuples. Each tuple contains the word ID and its corresponding TF-IDF score.
Note: The Tfidf score is a real number ranging from 0 to 1. Higher values indicate higher importance of the word within the document.
The output looks something like this:
Document 1: [(0, 0.4299876883131281), (1, 0.4299876883131281), (2, 0.4299876883131281), (3, 0.15869566208696556), (4, 0.15869566208696556), (5, 0.15869566208696556), (6, 0.4299876883131281), (7, 0.4299876883131281)]Document 2: [(3, 0.14736395618799455), (8, 0.3992843032295923), (9, 0.14736395618799455), (10, 0.3992843032295923), (11, 0.3992843032295923), (12, 0.3992843032295923), (13, 0.3992843032295923), (14, 0.3992843032295923)]Document 3: [(4, 0.1355937998721238), (5, 0.1355937998721238), (9, 0.1355937998721238), (15, 0.3673929317907688), (16, 0.3673929317907688), (17, 0.3673929317907688), (18, 0.3673929317907688), (19, 0.3673929317907688), (20, 0.3673929317907688), (21, 0.3673929317907688)]
Advantages
There are several reasons why the TfidfModel is beneficial to implement. Some of the reasons are mentioned below:
Feature vector representation:
TfidfModeltransforms documents into feature vectors, allowing various similarity analysis and machine learning tasks.Importance Weighting: This model assigns higher weights to words critical for a document and down weights common words that appear in many documents.
Dimensionality Reduction:
TfidfModelreduces the dimensionality of the data, which can be especially useful in large text datasets.
Conclusion
In conclusion, the TfidfModel in Gensim allows NLP developers to represent and compare documents based on their importance and relevance to the entire corpus, providing valuable insights for various natural language processing tasks.
Free Resources