What is gensim.similarities.MatrixSimilarity() function?
Gensim is a widely used Python library for natural language processing (NLP) tasks and comprises the MatrixSimilarity function, crucial in measuring the similarity between documents based on their content.
The gensim.similarities.MatrixSimilarity() function
The gensim.similarities.MatrixSimilarity() function in Gensim is used to calculate the similarity between documents using the concept of
This function helps us quantify document similarity, gain insights into text relationships, identify related documents, and improve effectiveness in NLP tasks.
Syntax
The syntax for using the gensim.similarities.MatrixSimilarity() function is given below:
similarity_matrix = MatrixSimilarity(corpus, num_features=num_features)
corpusis a required parameter, representing the corpus of documents as a list of vectors or a sparse matrix.num_featuresis an optional parameter representing the dimensionality of the feature space. If not given, it will be assumed from the corpus.
Note: Make sure you have the Gensim library installed (you can install it using pip install gensim).
Code
Let's implement the gensim.similarities.MatrixSimilarity() function in the code below:
from gensim.similarities import MatrixSimilarityfrom gensim.corpora import Dictionarytexts = [['apple', 'banana', 'orange'], ['orange', 'kiwi', 'grape']]dictionary = Dictionary(texts)corpus = [dictionary.doc2bow(text) for text in texts]similarity_matrix = MatrixSimilarity(corpus)similarity = similarity_matrix[corpus[0]]print(similarity)
Code explanation:
Line 1–2: Firstly, we import the necessary modules and classes from Gensim like
MatrixSimilarityfromgensim.similaritiesandDictionaryfromgensim.corpora.Line 4: Next, we define a list of
textscontaining two sublists representing different documents.Line 5: Here, we create a
Dictionaryobject,dictionary, to represent the vocabulary of the documents.Line 6: Now, we convert each document in
textsto a bag-of-words representation usingdictionary.doc2bow(). This converts each document into a list of tuples containing a word ID and its frequency. We store the result incorpusvariable.
Note: To learn more about Bag-of-Words (BoW) corpus, click here.
Line 7–8: Moving on, we initialize the
similarity_matrixusingMatrixSimilarity(corpus). This creates a similarity index based on the given corpus. Then, we compute the similarity of the first document (corpus[0]) by accessingsimilarity_matrix[corpus[0]].Line 9: Finally, we print the similarity scores, which indicate the similarity between the first document and each document in the corpus.
Output
Upon execution, the code will print the similarity scores between the first document and all other documents in the corpus.
In the case of cosine similarity, a value of 1 indicates that the two vectors being compared are identical, and a value of 0 represents no similarity or orthogonality. Values between 0 and 1 indicate varying degrees of similarity.
The output looks like this:
[0.99999994, 0.3333333]
The first value 0.99999994 means that the first document is compared with itself, resulting in a similarity score rounded off to 1, as they are identical. The second value 0.33333334 denotes the similarity score between the first and second documents. It indicates that the two documents have some overlapping words but are not identical.
Conclusion
Overall, the MatrixSimilarity function in Gensim is a strong tool for exploring document similarity. The functionality of creating a similarity matrix facilitates NLP developers to compare and measure document similarity.
Free Resources