What is gensim.similarities.MatrixSimilarity() function?

Gensim is a widely used Python library for natural language processing (NLP) tasks and comprises the MatrixSimilarity function, crucial in measuring the similarity between documents based on their content.

The `gensim.similarities.MatrixSimilarity()` function

The gensim.similarities.MatrixSimilarity() function in Gensim is used to calculate the similarity between documents using the concept of cosine similarityCosine similarity is a measure of similarity between two non-zero vectors defined in an inner product space.. It creates a similarity matrix that represents the pairwise similarity scores between documents.

This function helps us quantify document similarity, gain insights into text relationships, identify related documents, and improve effectiveness in NLP tasks.

Syntax

The syntax for using the gensim.similarities.MatrixSimilarity() function is given below:

Code explanation:

Line 1–2: Firstly, we import the necessary modules and classes from Gensim like MatrixSimilarity from gensim.similarities and Dictionary from gensim.corpora.
Line 4: Next, we define a list of texts containing two sublists representing different documents.
Line 5: Here, we create a Dictionary object, dictionary, to represent the vocabulary of the documents.
Line 6: Now, we convert each document in texts to a bag-of-words representation using dictionary.doc2bow(). This converts each document into a list of tuples containing a word ID and its frequency. We store the result in corpus variable.

Note: To learn more about Bag-of-Words (BoW) corpus, click here.

Line 7–8: Moving on, we initialize the similarity_matrix using MatrixSimilarity(corpus). This creates a similarity index based on the given corpus. Then, we compute the similarity of the first document (corpus[0]) by accessing similarity_matrix[corpus[0]].
Line 9: Finally, we print the similarity scores, which indicate the similarity between the first document and each document in the corpus.

Output

Upon execution, the code will print the similarity scores between the first document and all other documents in the corpus.

In the case of cosine similarity, a value of 1 indicates that the two vectors being compared are identical, and a value of 0 represents no similarity or orthogonality. Values between 0 and 1 indicate varying degrees of similarity.

The output looks like this:

The first value 0.99999994 means that the first document is compared with itself, resulting in a similarity score rounded off to 1, as they are identical. The second value 0.33333334 denotes the similarity score between the first and second documents. It indicates that the two documents have some overlapping words but are not identical.

Conclusion

Overall, the MatrixSimilarity function in Gensim is a strong tool for exploring document similarity. The functionality of creating a similarity matrix facilitates NLP developers to compare and measure document similarity.