Text-based recommendations are an integral part of information retrieval systems, helping users discover relevant content based on their queries. One powerful technique for this is Term Frequency-Inverse Document Frequency (TF-IDF).
TF-IDF is the calculation of the relevance of a word in a document with reference to a collection of documents i.e., corpus. TF-IDF of a term is directly proportional to the frequency of a word in a document. However, it is then neutralized by the occurrence of the same word in the entire document collection i.e. the corpus.
TF(t,d) finds out the importance of a term t in any given document d and is calculated as follows:
IDF(t,D) finds out how common is the term t in all of the documents of the corpus D, thus filtering out any common terms or stop words. IDF(t,d) is calculated using the following formula:
TF-IDF is the product of TF and IDF scores:
A greater TF-IDF score would mean a document d is more relevant to the term t. For a term that appears excessively in the corpus, the IDF score would increase, in turn making the TF-IDF score close to 0.
TF-IDF creates recommendations through the following steps:
Preprocess the query.
Create TF-IDF vectors.
Compute similarity scores.
Rank documents.
Generate recommendations.
The preprocessing step involves converting all text into lowercase, removing punctuations as well as any stop words. This ensures the query text is clean and in a standard format.
The same preprocessing steps are also carried out on the following sample document corpus to ensure that all documents are in the same standard format.
Documents | Original Text | After Preprocessing |
Document 1 (D1) | TF-IDF is a technique used in information retrieval. | tf-idf technique used information retrieval |
Document 2 (D2) | It determines the significance. | determines significance |
Document 3 (D3) | TF-IDF considers both term frequency and document frequency. | tf-idf considers term frequency document frequency |
Query | TF-IDF importance in document retrieval | tf-idf importance document retrieval |
The second step is to convert the preprocessed query into a TF-IDF vector. This vector represents the query in the same numerical space as the TF-IDF vectors of the documents.
This is how TF-IDF(considers,D3) is calculated:
Similarly, TF-IDF will be calculated for each term in every document. Below are the TF-IDF vectors of documents and the query represented in matrix form:
considers | determines | document | frequency | information | retrieval | significance | technique | term | tf-idf | used | |
D1 | 0 | 0 | 0 | 0 | 0.47 | 0.47 | 0 | 0.47 | 0 | 0.36 | 0.47 |
D2 | 0 | 0.707 | 0 | 0 | 0 | 0 | 0.71 | 0 | 0 | 0 | 0 |
D3 | 0.36 | 0 | 0.363 | 0.72 | 0 | 0 | 0 | 0 | 0.36 | 0.28 | 0 |
Query | 0 | 0 | 0.623 | 0 | 0 | 0.623 | 0 | 0 | 0 | 0.473 | 0 |
In this step, the similarity scores between the TF-IDF vector of the query and the TF-IDF vectors of the documents are calculated. Cosine similarity is one of the metrics used for this purpose. The higher the cosine similarity, the more similar the documents are to the query.
In this case, the cosine similarity of the query with the documents is as follows:
Cosine Similarity | |
D1 | 0.4594 |
D2 | 0 |
D3 | 0.36 |
After calculating the similarity, documents are sorted based on their similarity scores in descending order. This ranking identifies the most relevant documents to the query.
For this example, D1 has the highest ranking, followed by D3 and D2.
The last step is to present the top N documents with the highest similarity scores as recommendations to the user.
Here's the coding example of how to perform text-based recommendations using TF-IDF:
from sklearn.feature_extraction.text import TfidfVectorizerfrom sklearn.metrics.pairwise import cosine_similarityimport reimport pandas as pd# Example documents or corpusdocuments = ["TF-IDF is a technique used in information retrieval.","It determines the significance.","TF-IDF considers both term frequency and document frequency.",]# Custom tokinizer that tokenizes on comma, space and perioddef custom_tokenizer(text):pattern = r'[, .]+'tokens = [token for token in re.split(pattern, text) if token]return tokensprint(custom_tokenizer(documents[0]))# Example Queryquery = "TF-IDF importance in document retrieval"# Create a TF-IDF vectorizertfidf_vectorizer = TfidfVectorizer(tokenizer=custom_tokenizer,stop_words='english')tfidf_matrix = tfidf_vectorizer.fit_transform(documents)# Get the words as headersheaders = tfidf_vectorizer.get_feature_names()# Convert vectors in a matrixdf = pd.DataFrame(tfidf_matrix.toarray(), columns = headers )print("The TF-IDF Matrix of Documents/Corpus \n", df)# Transform the query into a TF-IDF vectorquery_tfidf = tfidf_vectorizer.transform([query])print("\nTF-IDF Vector of Query\n",query_tfidf)# Calculate cosine similarity between the query and documentscosine_similarities = cosine_similarity(query_tfidf, tfidf_matrix)# Get indices of recommended documentsrecommended_indices = cosine_similarities.argsort()[0][::-1]# Print recommended documentsprint("\nHere are the recommended documents:")for idx in recommended_indices:print(f"Recommended: Document {idx + 1} - Similarity Score: {cosine_similarities[0][idx]:.4f}")
In the above code:
Lines 1–4: All necessary libraries are being imported in these lines.
Lines 6–10: A document corpus for this example has been defined here.
Lines 12–16: The custom_tokenizer()
function takes the text and tokenizes it on commas, periods, and spaces. This ensures that terms connected with hyphens like tf-idf, do not split into two.
Line 19: An example query is defined. Our code will provide text-based document recommendations for this query.
Line 22: TF-IDF vectorizer object is initialized, which will take care of the stop words as well as use the custom_tokenizer()
function for tokenization.
Line 23: The document corpus is passed to the TF-IDF vectorizer object to convert them into TF-IDF vectors.
Line 25: The words/terms of TF-IDF vectors are stored in a variable.
Lines 28–29: TF-IDF vectors are converted into a DataFrame for visualization purposes.
Lines 32–33: The TF-IDF vector of the query is created via the same TF-IDF vectorizer object defined earlier.
Line 36: The cosine similarity of the query is being calculated with each of the corpus documents.
Line 39: The indexes of the recommended documents are stored in the recommended_indices
variable.
Lines 42–44: Recommended documents are printed in order of their similarity with the document, along with the similarity score.