What is an n-gram representation?
Continual word, symbol, or token sequences are known as n-gram representations. They are the adjacent groups of items in a document. In natural language processing (NLP) tasks, they are relevant when we deal with textual data.
n is a positive integer variable that can have values like 1, 2, 3, 4, and so on.
Depending on the value of n, n-grams have the following different types or categories:
- Unigram
- Bigram
- Trigram
- n-gram
Unigram
Unigrams are a type of n-gram where the value of n is 1. Unigram means taking only one word or token at a time.
Example:
Text = “Educative is the best platform”
The unigram for the above text is as follows:
[“Educative”, “is”, “the”, “best”, “platform”]
Bigram
Bigrams are a type of n-gram where the value of n is 2. Bigram means taking two words or tokens at a time.
Example:
Text = “Educative is the best platform”
The bigram for the above text is as follows:
[“Educative is”, “is the”, “the best”, “best platform”]
Trigram
Trigrams are a type of n-gram where the value of n is 3. Trigram means taking three words or tokens at a time.
Example:
text = “Educative is the best platform”
The trigram for the above text is as follows:
[“Educative is the”, “is the best”, “the best platform”]
n-gram
n-grams can be defined for any given value of n.
Let us consider n to be 4. This means taking fours words or tokens at a time.
Example:
text = “Educative is the best platform”
The 4-gram for the above text is as follows:
[“Educative is the best”, “is the best platform”]
import redef n_gram(text, n=1):text = text.lower()text = re.sub(r'[^a-zA-Z0-9\s]', ' ', text)gram_tokens = [token for token in text.split(" ") if token != ""]ngrams = zip(*[gram_tokens[i:] for i in range(n)])return [" ".join(ngram) for ngram in ngrams]def unigram(text):print("Unigram")print(n_gram(text, 1))def bigram(text):print("Bigram")print(n_gram(text, 2))def trigram(text):print("Trigram")print(n_gram(text, 3))if __name__ == "__main__":text = "Educative is the best platform"unigram(text)bigram(text)trigram(text)
Explanation
- Line 1: We import the
remodule. - Line 3: We define the
n_gram()method. This generates the n-gram for the given text and thenvalue. - Line 4: The text is converted to lowercase.
- Line 5: The non-alphanumeric characters in the text are replaced with space.
- Line 6: The tokens are generated by splitting the text by the space character.
- Lines 7–8: The n-grams are generated and returned as a list.
- Lines 10–12: We define the
unigram()method. This generates the unigram representation of the text by invoking then_gram()method withn=1. - Lines 14–16: We define the
bigram()method. This generates the bigram representation of the text by invoking then_gram()method withn=2. - Lines 18–20: We define the
trigram()method. This generates the trigram representation of the text by invoking then_gram()method withn=3. - Line 23: We define the text.
- Line 24: We invoke the
unigram()method. - Line 25: We invoke the
bigram()method. - Line 26: We invoke the
trigram()method.