What is an n-gram representation?

Continual word, symbol, or token sequences are known as n-gram representations. They are the adjacent groups of items in a document. In natural language processing (NLP) tasks, they are relevant when we deal with textual data.

n is a positive integer variable that can have values like 1, 2, 3, 4, and so on.

Depending on the value of n, n-grams have the following different types or categories:

Unigram
Bigram
Trigram
n-gram

import re
def n_gram(text, n=1):
    text = text.lower()
    text = re.sub(r'[^a-zA-Z0-9\s]', ' ', text)
    gram_tokens = [token for token in text.split(" ") if token != ""]
    ngrams = zip(*[gram_tokens[i:] for i in range(n)])
    return [" ".join(ngram) for ngram in ngrams]
def unigram(text):
    print("Unigram")
    print(n_gram(text, 1))
def bigram(text):
    print("Bigram")
    print(n_gram(text, 2))
def trigram(text):
    print("Trigram")
    print(n_gram(text, 3))
if __name__ == "__main__":
    text = "Educative is the best platform"
    unigram(text)
    bigram(text)
    trigram(text)

Explanation

Line 1: We import the re module.
Line 3: We define the n_gram() method. This generates the n-gram for the given text and the n value.
Line 4: The text is converted to lowercase.
Line 5: The non-alphanumeric characters in the text are replaced with space.
Line 6: The tokens are generated by splitting the text by the space character.
Lines 7–8: The n-grams are generated and returned as a list.
Lines 10–12: We define the unigram() method. This generates the unigram representation of the text by invoking the n_gram() method with n=1.
Lines 14–16: We define the bigram() method. This generates the bigram representation of the text by invoking the n_gram() method with n=2.
Lines 18–20: We define the trigram() method. This generates the trigram representation of the text by invoking the n_gram() method with n=3.
Line 23: We define the text.
Line 24: We invoke the unigram() method.
Line 25: We invoke the bigram() method.
Line 26: We invoke the trigram() method.