What is text classification in deep learning?

What is deep learning?

Deep learning is a subset of machine learning that involves using artificial neural networks to imitate the structure and function of the human brain. Deep learning makes use of artificial neural networks in a much more complex way than machine learning.

What is text classification?

Text classification is the process of labeling or organizing text data into groups – it forms a fundamental part of Natural Language Processing. In the digital age that we live in, we are surrounded by text on our social media accounts, commercials, websites, Ebooks, etc. The majority of this text data is unstructured, so classifying this data can be extremely useful.

Applications

Text classification has a wide array of applications. Some popular uses are:

Spam detection in emails
Sentiment analysis of online reviews
Topic labeling documents like research papers
Language detection like in Google Translate
Age/gender identification of anonymous users
Tagging online content
Speech recognition used in virtual assistants like Siri and Alexa

In this shot, we ll learn about text classification using the Naive Bayes Algorithm (NB).

The following are the series of steps to perform data classification on any dataset.

Step 1

Add the required libraries. If not available, use:

pip install library

import pandas as pd
import numpy as np
from nltk.tokenize import word_tokenize
from nltk import pos_tag
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.preprocessing import LabelEncoder
from collections import defaultdict
from nltk.corpus import wordnet as wn
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

Step 2

Add the relevant dataset. A user can use the read_csv() method of the pandas library as imported in the libraries above. To do this, use the following command:

pd.read_csv(data.csv)

Step 3

Perform the pre-processing of data. This means transforming any raw data into a more understandable NLP context. Below is the list of processes in pre-processing:

Remove Blank rows in Data, if any
Change all the text to lower case
Word TokenizationThis is a process of breaking a stream of text up into words, phrases, symbols, or other meaningful elements called tokens.
Remove Stop wordsStopwords are English words that don’t add much meaning to a sentence. They can be safely ignored without sacrificing the meaning of the sentence. For example, words like they, he, have, etc.
Remove non-alpha text
Word LemmatizationReduces the inflectional forms of each word into a common base or root.

Step 4

Prepare the training and testing dataset using the train_test_split() method of the sklearn library. For better accuracy, keep test_size = 0.25.

Step 5

Perform encoding on the dataset to differentiate between different labels and assign them 0 or 1:

Encoder = LabelEncoder()
Train_Y = Encoder.fit_transform(Train_Y)
Test_Y = Encoder.fit_transform(Test_Y)

Step 6

Convert text data to vectors that the model can understand. The user can make use of TF-IDF (short for term frequency–inverse document frequency), which is a numerical statistic that is intended to reflect how important a word is to a document in a collection.

Step 7

Perform machine learning using deep learning tools:

classifier = LogisticRegression()
classifier.fit(Train_X_Tfidf, Train_Y)
predictions_deepL = classifier.predict(Test_X_Tfidf)

print("Accuracy: ",accuracy_score(predictions_deepL, Test_Y)*100)

Free Resources