Trusted answers to developer questions
Trusted Answers to Developer Questions

Related Tags


How to do text classification with Naive Bayes algorithm

Sarvech Qadir

What are Naive Bayes classifiers?

Naive Bayes classifiers are based on the Bayes’ TheoremThe Bayes’ Theorem assumes that the occurrence or absence of a feature does not influence the presence or absence of some other feature. In statistics, Naive Bayes classifiers are used as simple “probabilistic classifiers” based on applying Bayes’ theorem with strong independent assumptions between the features.

What is Text Classification?

Text Classificationis the process of labeling or organizing text data into groups forms a fundamental part of Natural Language Processing. In today’s digital age, we are surrounded by text on our social media accounts, commercials, websites, Ebooks, etc. The majority of this text data is unstructured, so classifying this data can be extremely useful.


Text Classification has a wide array of applications. Some popular uses are:

  • Spam detection in emails
  • Sentiment analysis of online reviews
  • Topic labeling documents (e.g., research papers)
  • Language detection like in Google Translate
  • Age/gender identification of anonymous users
  • Tagging online content
  • Speech recognition used in virtual assistants like Siri or Alexa
svg viewer
Sentiment Analysis is an important application of Text Classification

The following series of steps are used to perform data classification on any dataset:

Step 1

Add the required libraries. If not available, use:

pip install library
import pandas as pd
import numpy as np
from nltk.tokenize import word_tokenize
from nltk import pos_tag
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.preprocessing import LabelEncoder
from collections import defaultdict
from nltk.corpus import wordnet as wn
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import model_selection, naive_bayes
from sklearn.metrics import accuracy_score

Step 2

Add the relevant dataset. A user can use the read_csv() method of the pandas’ library to read a CSV file. To do so, use the following command:


Step 3

Perform the pre-processing of data, i.e., transforming any raw data into a more understandable NLP context. The following is a list of processes in pre-processing:

  1. Remove blank rows in Data, if any
  2. Change all the text to lower case
  3. Word TokenizationThe process of breaking a stream of text up into words, phrases, symbols, or other meaningful elements called tokens.
  4. Remove Stop wordsEnglish words that don’t add much meaning to a sentence. They can be ignored without sacrificing the meaning of the sentence. For example, words like the, he, have, etc.
  5. Remove Non-alpha textAll characters except letters and numbers (e.g., “!”).
  6. Word LemmatizationReducing the inflectional forms of each word into a common base or root. For example, the verb ‘to walk’ may appear as ‘walk’, ‘walked’, ‘walks’ or ‘walking’. The base form, ‘walk’, which one might look up in a dictionary, is called the word’s lemma.

Step 4

Prepare the training and testing dataset using the train_test_split() method of the sklearn library. For better accuracy, keep test_size = 0.25.

Step 5

Perform encoding on the dataset to differentiate between different labels and assign them a 0 or 1.

Encoder = LabelEncoder()
Train_Y = Encoder.fit_transform(Train_Y)
Test_Y = Encoder.fit_transform(Test_Y)

Step 6

Convert text data to vectors that the model can understand. The user can make use of TF-IDF`short for term frequency–inverse document frequency, to reflect on how important a word is to a document in a collection.

Step 7

Perform machine learning using Naive Bayes classifier:

Naive = naive_bayes.MultinomialNB(),Train_Y)
predictions_NB = Naive.predict(Test_X_Tfidf)
print("Accuracy: ",accuracy_score(predictions_NB, Test_Y)*100)




Sarvech Qadir
Copyright ©2022 Educative, Inc. All rights reserved

View all Courses

Keep Exploring