What is gensim.models.phrases.Phrases() function?
Gensim is a Python library for natural language processing (NLP) that offers a range of powerful tools to extract insights from text data.
The gensim.models.phrases.Phrases() function
The gensim.models.phrases.Phrases() is a function provided by Gensim that detects and captures commonly occurring phrases or word combinations in a corpus.
The Phrases() method helps us recognize and encode textual patterns like patterns with considerable semantic value, such as collocations or multi-word formulations, automatically and extract information from text corpora.
Syntax
Below is the syntax of the gensim.models.phrases.Phrases() function:
gensim.models.phrases.Phrases(sentences, min_count=5,threshold=10.0, delimiter=b'_')
The
sentencesis a required parameter as it represents a list of tokenized sentences.The
min_countis an optional parameter set to 5 by default, showing the minimum frequency of a phrase to be considered during training.The
thresholdis an optional parameter set to 10.0 by default, representing the threshold score for forming phrases. Higher values result in fewer phrases.delimiteris an optional parameter withdefault=b’_’, used to concatenate words in a phrase.
Note: Make sure you have the Gensim library installed (you can install it using pip install gensim).
Code
Let's walk through an example and understand the use of gensim.models.phrases.Phrases() method in the code below:
from gensim.models.phrases import Phrases, Phraser# Sentencessentences = [["machine", "learning", "is", "an", "exciting", "field"],["artificial", "intelligence", "is", "challenging"],["machine", "learning", "and", "artificial", "intelligence", "are", "related"],]# Creating Phrases objectphrases = Phrases(sentences, min_count=1, threshold=1)# Creating Phraser object for efficiencyphraser = Phraser(phrases)# Transform sentences and print phrasesfor sentence in sentences:transformed_sentence = phraser[sentence]phrase_list = [phrase.replace("_", " ") for phrase in transformed_sentence if "_" in phrase]print(phrase_list)
Code explanation
Line 1: Firstly, we import the required modules and classes from Gensim like
PhrasesandPhraser.Line 4–8: Next, we define the
sentencelist in thesentencesvariable, containing three different sentences.Line 11: Then, we create a
Phrasesobject,phrases, by passing thesentenceslist as the input. Themin_countparameter sets the minimum frequency of collocations, and thethresholdparameter selects the score threshold for collocation detection.Line 14: Here, we create a
Phraserobjectphraserto transform the sentences more quickly.Line 17–19: Now, we iterate over each sentence in the
sentenceslist using aforloop. We first apply thephraserto the currentsentenceusingphraser[sentence]to obtain the transformed sentence with identified collocations. Once that is done, we create thephrase_listby replacing the underscore (_) in each transformed phrase with a space, but only for phrases that contain an underscore.Line 20: Finally, we print the
phrase_list, which contains the transformed phrases with spaces separating the words.
Output
Upon execution, the code will identify the
The output looks something like this:
['machine learning']['artificial intelligence']['machine learning', 'artificial intelligence']
Conclusion
Therefore, the gensim.models.phrases.Phrases() method facilitates NLP developers to automatically recognize and encode relevant textual patterns. Using this function, they can increase the quality of text analysis, improve topic modeling findings, and obtain more profound insights into the underlying semantics of a corpus.
Free Resources