What is the Jaccard Similarity measure in NLP?
Overview
Document/Text similarity is estimating how similar the given documents are to each other. There are different ways of measuring document similarity, such as Cosine Similarity and Euclidean Distance.
Jaccard Similarity is one of the ways to determine the similarity between the documents.
Jaccard Similarity is defined as the ratio of the intersection of the documents to the union of the documents. In other words, it’s the division of the number of tokens common to all documents by the total number of tokens in all documents.
Considering tokens as words in the document, Jaccard Similarity is the ratio of the number of words common to all documents by the total number of words.
The value of Jaccard Similarity ranges from 0 to 1, where 1 indicates the documents are identical while 0 means there is nothing common among the documents.
The mathematical representation of the similarity is as follows:
Example
Consider the following example,
doc_1= “educative is the best platform out there.”
doc_2= “educative is a new platform.”
Tokenizing the documents above as words (ignore the punctuations), we get the following:
-
words_doc_1 =
{'educative', 'is', 'the', 'best', 'platform', 'out', 'there'} -
words_doc_2 =
{'educative', 'is', 'a', 'new', 'platform'}
The intersection or the common words between the documents are - {'educative', 'is', 'platform'}. 3 words are familiar.
The union or all the words in the documents are - {'educative', 'is', 'the', 'best', 'platform', 'out', 'there', 'a', 'new'}. Totally, there are 9 words.
Hence, the Jaccard similarity is 3/9 = 0.333
Code
def intersection(doc_1, doc_2):return doc_1.intersection(doc_2)def union(doc_1, doc_2):return doc_1.union(doc_2)def jaccard_similarity(doc_1, doc_2):words_doc_1 = doc_1.lower().split(' ')words_doc_2 = doc_2.lower().split(' ')words_doc_1_set = set(words_doc_1)words_doc_2_set = set(words_doc_2)intersection_docs = intersection(words_doc_1_set, words_doc_2_set)union_docs = union(words_doc_1_set, words_doc_2_set)return len(intersection_docs) / len(union_docs)doc_1 = "educative is the best platform out there"doc_2 = "educative is a new platform"print("doc_1 - '%s'" % (doc_1, ))print("doc_2 - '%s'" % (doc_2, ))print("Jaccard_similarity(doc_1, doc_2) = %s" % (jaccard_similarity(doc_1, doc_2)))
Explanation
- Line 1-2:
intersectionfunction returns the convergence between sets of documents. - Line 4-5:
unionfunction returns the union between documents. - Line 9-10: Each document is converted to lowercase and is split using the space character to get the words in the document.
- Line 12-13: Each of the documents is converted into sets.
- Line 15: We get the intersection of
doc_1anddoc_2using theintersectionfunction in lines 1-2. - Line 17: We get the union of
doc_1anddoc_2using theunionfunction in lines 4-5. - Line 19: The result of dividing the number of words in the intersection by the number of words in the union is returned.