What is text classification in deep learning?

What is deep learning?

Deep learning is a subset of machine learning that involves using artificial neural networks to imitate the structure and function of the human brain. Deep learning makes use of artificial neural networks in a much more complex way than machine learning.

What is text classification?

Text classification is the process of labeling or organizing text data into groups – it forms a fundamental part of Natural Language Processing. In the digital age that we live in, we are surrounded by text on our social media accounts, commercials, websites, Ebooks, etc. The majority of this text data is unstructured, so classifying this data can be extremely useful.

Applications

Text classification has a wide array of applications. Some popular uses are:

  • Spam detection in emails
  • Sentiment analysis of online reviews
  • Topic labeling documents like research papers
  • Language detection like in Google Translate
  • Age/gender identification of anonymous users
  • Tagging online content
  • Speech recognition used in virtual assistants like Siri and Alexa
Sentiment Analysis is an important application of Text Classification
Sentiment Analysis is an important application of Text Classification

In this shot, we ll learn about text classification using the Naive Bayes Algorithm (NB).

The following are the series of steps to perform data classification on any dataset.

Step 1

Add the required libraries. If not available, use:

pip install library
import pandas as pd
import numpy as np
from nltk.tokenize import word_tokenize
from nltk import pos_tag
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.preprocessing import LabelEncoder
from collections import defaultdict
from nltk.corpus import wordnet as wn
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

Step 2

Add the relevant dataset. A user can use the read_csv() method of the pandas library as imported in the libraries above. To do this, use the following command:

pd.read_csv(data.csv)

Step 3

Perform the pre-processing of data. This means transforming any raw data into a more understandable NLP context. Below is the list of processes in pre-processing:

  1. Remove Blank rows in Data, if any
  2. Change all the text to lower case
  3. Word TokenizationThis is a process of breaking a stream of text up into words, phrases, symbols, or other meaningful elements called tokens.
  4. Remove Stop wordsStopwords are English words that don’t add much meaning to a sentence. They can be safely ignored without sacrificing the meaning of the sentence. For example, words like they, he, have, etc.
  5. Remove non-alpha text
  6. Word LemmatizationReduces the inflectional forms of each word into a common base or root.

Step 4

Prepare the training and testing dataset using the train_test_split() method of the sklearn library. For better accuracy, keep test_size = 0.25.

Step 5

Perform encoding on the dataset to differentiate between different labels and assign them 0 or 1:

Encoder = LabelEncoder()
Train_Y = Encoder.fit_transform(Train_Y)
Test_Y = Encoder.fit_transform(Test_Y)

Step 6

Convert text data to vectors that the model can understand. The user can make use of TF-IDF (short for term frequency–inverse document frequency), which is a numerical statistic that is intended to reflect how important a word is to a document in a collection.

Step 7

Perform machine learning using deep learning tools:

classifier = LogisticRegression()
classifier.fit(Train_X_Tfidf, Train_Y)
predictions_deepL = classifier.predict(Test_X_Tfidf)

print("Accuracy: ",accuracy_score(predictions_deepL, Test_Y)*100)
Copyright ©2024 Educative, Inc. All rights reserved