What is gensim.models.phrases.Phrases() function?

Gensim is a Python library for natural language processing (NLP) that offers a range of powerful tools to extract insights from text data.

The `gensim.models.phrases.Phrases()` function

The gensim.models.phrases.Phrases() is a function provided by Gensim that detects and captures commonly occurring phrases or word combinations in a corpus.

The Phrases() method helps us recognize and encode textual patterns like patterns with considerable semantic value, such as collocations or multi-word formulations, automatically and extract information from text corpora.

Syntax

Below is the syntax of the gensim.models.phrases.Phrases() function:

The sentences is a required parameter as it represents a list of tokenized sentences.
The min_count is an optional parameter set to 5 by default, showing the minimum frequency of a phrase to be considered during training.
The threshold is an optional parameter set to 10.0 by default, representing the threshold score for forming phrases. Higher values result in fewer phrases.
delimiter is an optional parameter with default=b’_’ , used to concatenate words in a phrase.

Note: Make sure you have the Gensim library installed (you can install it using pip install gensim).

Code

Let's walk through an example and understand the use of gensim.models.phrases.Phrases() method in the code below:

from gensim.models.phrases import Phrases, Phraser
# Sentences
sentences = [
    ["machine", "learning", "is", "an", "exciting", "field"],
    ["artificial", "intelligence", "is", "challenging"],
    ["machine", "learning", "and", "artificial", "intelligence", "are", "related"],
]
# Creating Phrases object
phrases = Phrases(sentences, min_count=1, threshold=1)
# Creating Phraser object for efficiency
phraser = Phraser(phrases)
# Transform sentences and print phrases
for sentence in sentences:
    transformed_sentence = phraser[sentence]
    phrase_list = [phrase.replace("_", " ") for phrase in transformed_sentence if "_" in phrase]
    print(phrase_list)

Code explanation

Line 1: Firstly, we import the required modules and classes from Gensim like Phrases and Phraser.
Line 4–8: Next, we define the sentence list in the sentences variable, containing three different sentences.
Line 11: Then, we create a Phrases object, phrases, by passing the sentences list as the input. The min_count parameter sets the minimum frequency of collocations, and the threshold parameter selects the score threshold for collocation detection.
Line 14: Here, we create a Phraser object phraser to transform the sentences more quickly.
Line 17–19: Now, we iterate over each sentence in the sentences list using a for loop. We first apply the phraser to the current sentence using phraser[sentence] to obtain the transformed sentence with identified collocations. Once that is done, we create the phrase_list by replacing the underscore (_) in each transformed phrase with a space, but only for phrases that contain an underscore.
Line 20: Finally, we print the phrase_list, which contains the transformed phrases with spaces separating the words.

Output

Upon execution, the code will identify the collocationsA combination of words. in the sentences and print only the transformed collocations.

The output looks something like this: