How to perform text-based recommendations using TF-IDF

Text-based recommendations are an integral part of information retrieval systems, helping users discover relevant content based on their queries. One powerful technique for this is Term Frequency-Inverse Document Frequency (TF-IDF).

TF-IDF is the calculation of the relevance of a word in a document with reference to a collection of documents i.e., corpus. TF-IDF of a term is directly proportional to the frequency of a word in a document. However, it is then neutralized by the occurrence of the same word in the entire document collection i.e. the corpus.

TF(t,d) finds out the importance of a term t in any given document d and is calculated as follows:

A greater TF-IDF score would mean a document d is more relevant to the term t. For a term that appears excessively in the corpus, the IDF score would increase, in turn making the TF-IDF score close to 0.

Recommendations using TF-IDF

TF-IDF creates recommendations through the following steps:

Preprocess the query.
Create TF-IDF vectors.
Compute similarity scores.
Rank documents.
Generate recommendations.

1. Preprocess the query

The preprocessing step involves converting all text into lowercase, removing punctuations as well as any stop words. This ensures the query text is clean and in a standard format.

The same preprocessing steps are also carried out on the following sample document corpus to ensure that all documents are in the same standard format.

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import re
import pandas as pd
# Example documents or corpus
documents = [
    "TF-IDF is a technique used in information retrieval.",
    "It determines the significance.",
    "TF-IDF considers both term frequency and document frequency.",
]
# Custom tokinizer that tokenizes on comma, space and period 
def custom_tokenizer(text):
    pattern = r'[, .]+'
    tokens = [token for token in re.split(pattern, text) if token]
    return tokens
print(custom_tokenizer(documents[0]))
# Example Query
query = "TF-IDF importance in document retrieval"
# Create a TF-IDF vectorizer
tfidf_vectorizer = TfidfVectorizer(tokenizer=custom_tokenizer,stop_words='english')
tfidf_matrix = tfidf_vectorizer.fit_transform(documents)
# Get the words as headers
headers = tfidf_vectorizer.get_feature_names()
# Convert vectors in a matrix
df = pd.DataFrame(tfidf_matrix.toarray(), columns = headers )
print("The TF-IDF Matrix of Documents/Corpus \n", df)
# Transform the query into a TF-IDF vector
query_tfidf = tfidf_vectorizer.transform([query])
print("\nTF-IDF Vector of Query\n",query_tfidf)
# Calculate cosine similarity between the query and documents
cosine_similarities = cosine_similarity(query_tfidf, tfidf_matrix)
# Get indices of recommended documents
recommended_indices = cosine_similarities.argsort()[0][::-1]
# Print recommended documents
print("\nHere are the recommended documents:")
for idx in recommended_indices:
    print(f"Recommended: Document {idx + 1} - Similarity Score: {cosine_similarities[0][idx]:.4f}")

Code explanation

In the above code:

Lines 1–4: All necessary libraries are being imported in these lines.
Lines 6–10: A document corpus for this example has been defined here.
Lines 12–16: The custom_tokenizer() function takes the text and tokenizes it on commas, periods, and spaces. This ensures that terms connected with hyphens like tf-idf, do not split into two.
Line 19: An example query is defined. Our code will provide text-based document recommendations for this query.
Line 22: TF-IDF vectorizer object is initialized, which will take care of the stop words as well as use the custom_tokenizer() function for tokenization.
Line 23: The document corpus is passed to the TF-IDF vectorizer object to convert them into TF-IDF vectors.
Line 25: The words/terms of TF-IDF vectors are stored in a variable.
Lines 28–29: TF-IDF vectors are converted into a DataFrame for visualization purposes.
Lines 32–33: The TF-IDF vector of the query is created via the same TF-IDF vectorizer object defined earlier.
Line 36: The cosine similarity of the query is being calculated with each of the corpus documents.
Line 39: The indexes of the recommended documents are stored in the recommended_indices variable.
Lines 42–44: Recommended documents are printed in order of their similarity with the document, along with the similarity score.

Free AI Mock Interviews

Coding Interview

Coding PatternsFree Interview

Gain insights and practical experience with coding patterns through targeted MCQs and coding problems, designed to match and challenge your expertise level.

System Design

YouTubeFree Interview

Learn to design a video streaming platform like YouTube by tackling functional and non-functional requirements, core components, and high-level to detailed design challenges.

Free Resources

Documents	Original Text	After Preprocessing
Document 1 (D1)	TF-IDF is a technique used in information retrieval.	tf-idf technique used information retrieval
Document 2 (D2)	It determines the significance.	determines significance
Document 3 (D3)	TF-IDF considers both term frequency and document frequency.	tf-idf considers term frequency document frequency
Query	TF-IDF importance in document retrieval	tf-idf importance document retrieval

	considers	determines	document	frequency	information	retrieval	significance	technique	term	tf-idf	used
D1	0	0	0	0	0.47	0.47	0	0.47	0	0.36	0.47
D2	0	0.707	0	0	0	0	0.71	0	0	0	0
D3	0.36	0	0.363	0.72	0	0	0	0	0.36	0.28	0
Query	0	0	0.623	0	0	0.623	0	0	0	0.473	0

	Cosine Similarity
D1	0.4594
D2	0
D3	0.36

How to perform text-based recommendations using TF-IDF

Recommendations using TF-IDF

1. Preprocess the query

2. Create TF-IDF vectors

3. Compute similarity scores

4. Rank documents

5. Generate recommendations

Code example

Code explanation