Deep learning is a subset of machine learning that involves using artificial neural networks to imitate the structure and function of the human brain. Deep learning makes use of artificial neural networks in a much more complex way than machine learning.
Text classification is the process of labeling or organizing text data into groups – it forms a fundamental part of Natural Language Processing. In the digital age that we live in, we are surrounded by text on our social media accounts, commercials, websites, Ebooks, etc. The majority of this text data is unstructured, so classifying this data can be extremely useful.
Text classification has a wide array of applications. Some popular uses are:
In this shot, we ll learn about text classification using the Naive Bayes Algorithm (NB).
The following are the series of steps to perform data classification on any dataset.
Add the required libraries. If not available, use:
pip install library
import pandas as pd
import numpy as np
from nltk.tokenize import word_tokenize
from nltk import pos_tag
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.preprocessing import LabelEncoder
from collections import defaultdict
from nltk.corpus import wordnet as wn
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
Add the relevant dataset. A user can use the read_csv()
method of the pandas library as imported in the libraries above. To do this, use the following command:
pd.read_csv(data.csv)
Perform the pre-processing of data. This means transforming any raw data into a more understandable NLP context. Below is the list of processes in pre-processing:
Prepare the training and testing dataset using the train_test_split()
method of the sklearn
library. For better accuracy, keep test_size = 0.25
.
Perform encoding on the dataset to differentiate between different labels and assign them 0 or 1:
Encoder = LabelEncoder()
Train_Y = Encoder.fit_transform(Train_Y)
Test_Y = Encoder.fit_transform(Test_Y)
Convert text data to vectors that the model can understand.
The user can make use of TF-IDF
(short for term frequency–inverse document frequency), which is a numerical statistic that is intended to reflect how important a word is to a document in a collection.
Perform machine learning using deep learning tools:
classifier = LogisticRegression()
classifier.fit(Train_X_Tfidf, Train_Y)
predictions_deepL = classifier.predict(Test_X_Tfidf)
print("Accuracy: ",accuracy_score(predictions_deepL, Test_Y)*100)