How to perform text-based recommendations using TF-IDF

Text-based recommendations are an integral part of information retrieval systems, helping users discover relevant content based on their queries. One powerful technique for this is Term Frequency-Inverse Document Frequency (TF-IDF).

TF-IDF is the calculation of the relevance of a word in a document with reference to a collection of documents i.e., corpus. TF-IDF of a term is directly proportional to the frequency of a word in a document. However, it is then neutralized by the occurrence of the same word in the entire document collection i.e. the corpus.

TF(t,d) finds out the importance of a term t in any given document d and is calculated as follows:

IDF(t,D) finds out how common is the term t in all of the documents of the corpus D, thus filtering out any common terms or stop words. IDF(t,d) is calculated using the following formula:

TF-IDF is the product of TF and IDF scores:

A greater TF-IDF score would mean a document d is more relevant to the term t. For a term that appears excessively in the corpus, the IDF score would increase, in turn making the TF-IDF score close to 0.

Recommendations using TF-IDF

TF-IDF creates recommendations through the following steps:

  1. Preprocess the query.

  2. Create TF-IDF vectors.

  3. Compute similarity scores.

  4. Rank documents.

  5. Generate recommendations.

1. Preprocess the query

The preprocessing step involves converting all text into lowercase, removing punctuations as well as any stop words. This ensures the query text is clean and in a standard format.

The same preprocessing steps are also carried out on the following sample document corpus to ensure that all documents are in the same standard format.

Documents

Original Text

After Preprocessing

Document 1 (D1)

TF-IDF is a technique used in information retrieval.

tf-idf technique used information retrieval

Document 2 (D2)

It determines the significance.

determines significance

Document 3 (D3)

TF-IDF considers both term frequency and document frequency.

tf-idf considers term frequency document frequency

Query

TF-IDF importance in document retrieval

tf-idf importance document retrieval

2. Create TF-IDF vectors

The second step is to convert the preprocessed query into a TF-IDF vector. This vector represents the query in the same numerical space as the TF-IDF vectors of the documents.

This is how TF-IDF(considers,D3) is calculated:

Similarly, TF-IDF will be calculated for each term in every document. Below are the TF-IDF vectors of documents and the query represented in matrix form:


considers

determines  

document

frequency

information

retrieval

significance

technique

term

tf-idf

used

D1

0

0

0

0

0.47

0.47

0

0.47

0

0.36

0.47

D2

0

0.707

0

0

0

0

0.71

0

0

0

0

D3

0.36

0

0.363

0.72

0

0

0

0

0.36

0.28

0

Query

0

0

0.623

0

0

0.623

0

0

0

0.473

0

3. Compute similarity scores

In this step, the similarity scores between the TF-IDF vector of the query and the TF-IDF vectors of the documents are calculated. Cosine similarity is one of the metrics used for this purpose. The higher the cosine similarity, the more similar the documents are to the query.

In this case, the cosine similarity of the query with the documents is as follows:


Cosine Similarity

D1

0.4594

D2

0

D3

0.36

4. Rank documents

After calculating the similarity, documents are sorted based on their similarity scores in descending order. This ranking identifies the most relevant documents to the query.

For this example, D1 has the highest ranking, followed by D3 and D2.

5. Generate recommendations

The last step is to present the top N documents with the highest similarity scores as recommendations to the user.

Code example

Here's the coding example of how to perform text-based recommendations using TF-IDF:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import re
import pandas as pd
# Example documents or corpus
documents = [
"TF-IDF is a technique used in information retrieval.",
"It determines the significance.",
"TF-IDF considers both term frequency and document frequency.",
]
# Custom tokinizer that tokenizes on comma, space and period
def custom_tokenizer(text):
pattern = r'[, .]+'
tokens = [token for token in re.split(pattern, text) if token]
return tokens
print(custom_tokenizer(documents[0]))
# Example Query
query = "TF-IDF importance in document retrieval"
# Create a TF-IDF vectorizer
tfidf_vectorizer = TfidfVectorizer(tokenizer=custom_tokenizer,stop_words='english')
tfidf_matrix = tfidf_vectorizer.fit_transform(documents)
# Get the words as headers
headers = tfidf_vectorizer.get_feature_names()
# Convert vectors in a matrix
df = pd.DataFrame(tfidf_matrix.toarray(), columns = headers )
print("The TF-IDF Matrix of Documents/Corpus \n", df)
# Transform the query into a TF-IDF vector
query_tfidf = tfidf_vectorizer.transform([query])
print("\nTF-IDF Vector of Query\n",query_tfidf)
# Calculate cosine similarity between the query and documents
cosine_similarities = cosine_similarity(query_tfidf, tfidf_matrix)
# Get indices of recommended documents
recommended_indices = cosine_similarities.argsort()[0][::-1]
# Print recommended documents
print("\nHere are the recommended documents:")
for idx in recommended_indices:
print(f"Recommended: Document {idx + 1} - Similarity Score: {cosine_similarities[0][idx]:.4f}")

Code explanation

In the above code:

  • Lines 1–4: All necessary libraries are being imported in these lines.

  • Lines 6–10: A document corpus for this example has been defined here.

  • Lines 12–16: The custom_tokenizer() function takes the text and tokenizes it on commas, periods, and spaces. This ensures that terms connected with hyphens like tf-idf, do not split into two.

  • Line 19: An example query is defined. Our code will provide text-based document recommendations for this query.

  • Line 22: TF-IDF vectorizer object is initialized, which will take care of the stop words as well as use the custom_tokenizer() function for tokenization.

  • Line 23: The document corpus is passed to the TF-IDF vectorizer object to convert them into TF-IDF vectors.

  • Line 25: The words/terms of TF-IDF vectors are stored in a variable.

  • Lines 28–29: TF-IDF vectors are converted into a DataFrame for visualization purposes.

  • Lines 32–33: The TF-IDF vector of the query is created via the same TF-IDF vectorizer object defined earlier.

  • Line 36: The cosine similarity of the query is being calculated with each of the corpus documents.

  • Line 39: The indexes of the recommended documents are stored in the recommended_indices variable.

  • Lines 42–44: Recommended documents are printed in order of their similarity with the document, along with the similarity score.

Copyright ©2024 Educative, Inc. All rights reserved