Search⌘ K
AI Features

Solution Explanations: N-Grams

Explore N-grams concepts and their practical usage in text preprocessing and classification. Learn how to clean text data, extract bigrams and trigrams, and apply these features using Python's CountVectorizer and MultinomialNB classifier to improve text analysis.

Solution 1: Introduction to n-grams

Here’s the solution:

Python 3.8
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from nltk.tokenize import word_tokenize
import string
feedback_df = pd.read_csv('feedback.csv')
def preprocess(text):
text = text.lower()
translator = str.maketrans('', '', string.punctuation)
text = text.translate(translator)
return text
feedback_df['feedback'] = feedback_df['feedback'].apply(preprocess)
vectorizer = CountVectorizer(tokenizer=word_tokenize, ngram_range=(2, 3))
X = vectorizer.fit_transform(feedback_df['feedback'])
grams = vectorizer.get_feature_names()
print(grams)

Let’s go through the solution explanation:

  • Lines 7–11: We define the preprocess() function that lowercases text and removes its punctuation characters.

  • Line 12: We then apply the preprocess() function to the feedback column using the apply() method and save the result in the feedback column. ...