How to use SVMs for text classification

What are Support Vector Machines?

Support Vector Machine (SVM) is a simple supervised machine algorithm used for classification and regression purposes. What SVM does is tit SVM finds a hyperplane that creates a boundary between two classes of data to classify them.

What is text classification?

Text Classification is the process of labeling or organizing text data into groups – it forms a fundamental part of Natural Language Processing.

In the digital age that we live in, we are surrounded by text on our social media accounts, commercials, websites, Ebooks, etc. The majority of this text data is unstructured, so classifying this data can be extremely useful.

Applications

Text Classification has a wide array of applications. Some popular uses are:

  • Spam detection in emails
  • Sentiment analysis of online reviews
  • Topic labeling documents like research papers
  • Language detection like in Google Translate
  • Age/gender identification of anonymous users
  • Tagging online content
  • Speech recognition used in virtual assistants (like Siri and Alexa)
Sentiment Analysis is an important application of Text Classification
Sentiment Analysis is an important application of Text Classification

In this shot, we ll learn about text classification using support vector machines (SVMs).

Below are a series of steps that will allow you to perform data classification on any dataset.

Step 1

Add the required libraries. If not available, use:

pip install library
import pandas as pd
import numpy as np
from nltk.tokenize import word_tokenize
from nltk import pos_tag
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.preprocessing import LabelEncoder
from collections import defaultdict
from nltk.corpus import wordnet as wn
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import model_selection, svm
from sklearn.metrics import accuracy_score

Step 2

Add the relevant dataset using the following command. A user can use the read_csv() method of the pandas library to import as in the libraries above.

pd.read_csv(data.csv)

Step 3

Perform the pre-processing of data. This means transforming any raw data into a more understandable NLP context. The following are the list of processes in pre-processing:

  1. Remove any blank rows in Data
  2. Change all the text to lower case
  3. Word TokenizationThis is a process of breaking a stream of text into words, phrases, symbols, or other meaningful elements called tokens.
  4. Remove Stop wordsStop words are English words that do not add meaning to a sentence. They can be safely ignored without sacrificing the meaning of the sentence. For example, words like the, he, have, etc.
  5. Remove Non-alpha text
  6. Word LemmatizationReducing each word’s inflectional forms into a common base or root.

Step 4

Prepare the training and testing dataset using the train_test_split() method of the sklearn library. For better accuracy keep test_size = 0.25.

Step 5

Perform encoding on the dataset to differentiate between different labels and assign them 0 or 1.

Encoder = LabelEncoder()
Train_Y = Encoder.fit_transform(Train_Y)
Test_Y = Encoder.fit_transform(Test_Y)

Step 6

Convert text data to vectors that the model can understand. The user can make use of TF-IDFshort for term frequency–inverse document frequency is, a numerical statistic that is intended to reflect how important a word is to a document in a collection.

Step 7

Perform machine learning using SVM.

SVM = svm.SVC(C=1.0, kernel='linear', degree=3, gamma='auto')
SVM.fit(Train_X_Tfidf,Train_Y)
// predict labels
predictions_SVM = SVM.predict(Test_X_Tfidf)
// get the accuracy
print("Accuracy: ",accuracy_score(predictions_SVM, Test_Y)*100)
Copyright ©2024 Educative, Inc. All rights reserved