Sentiment analysis is a natural language processing (NLP) technique that determines the sentiment expressed in a text. This Answer will explore how to perform sentiment analysis using NLTK’s Movie Review dataset. We will use the natural language toolkit (NLTK) and scikit-learn to build a sentiment classifier based on support vector machines (SVM).
Support vector machines (SVMs) are a popular class of supervised learning algorithms for classification and regression tasks. In sentiment analysis, SVMs are often chosen for their effectiveness in handling high-dimensional data, such as text data represented as TF-IDF vectors.
Here’s a brief explanation of why SVMs are suitable for sentiment analysis:
Effective in high-dimensional spaces: In sentiment analysis, each document (e.g., movie review) is represented as a high-dimensional vector, where each dimension corresponds to a unique word or term. SVMs are effective in high-dimensional spaces, making them well-suited for tasks like text classification.
Effective for linear and nonlinear separation: SVMs aim to find the hyperplane that best separates the data points belonging to different classes (positive and negative sentiments, in this case). Using appropriate kernel functions, SVMs can handle linear and nonlinearly separable data.
Robustness to overfitting: SVMs have regularization parameters that help control overfitting, making them robust to noisy or complex datasets.
Global optimal solution: SVMs aim to find the hyperplane that maximizes the margin between classes, which leads to better generalization and robustness of the model.
Fewer training examples are required: SVMs are effective even when the number of training examples is smaller than the number of features, which is common in text data where the vocabulary size can be large.
We aim to build a sentiment analysis model to classify movie reviews into positive or negative sentiments. We will use a dataset of movie reviews labeled as positive or negative to train and evaluate our sentiment classifier.
Let’s start by importing the required libraries for our sentiment analysis project. We’ll use NLTK for natural language processing tasks and scikit-learn for machine learning tasks.
import nltkfrom nltk.corpus import movie_reviewsfrom nltk.classify.scikitlearn import SklearnClassifierfrom sklearn.svm import SVCfrom sklearn.feature_extraction.text import TfidfVectorizerfrom sklearn.model_selection import train_test_splitfrom sklearn.metrics import accuracy_scoreimport random
Next, we need to load the Movie Review dataset provided by NLTK. This dataset contains reviews labeled as positive or negative.
nltk.download('movie_reviews')documents = [(list(movie_reviews.words(fileid)), category)for category in movie_reviews.categories()for fileid in movie_reviews.fileids(category)]random.shuffle(documents)
The code above is explained in detail below:
Line 1: Download the 'movie_reviews'
dataset using NLTK.
Lines 2–4: Create a list of tuples, each containing a movie review’s word list and its category (positive or negative).
Line 5: Shuffle the list of tuples to randomize the order of reviews for later splitting into training and testing sets.
To prepare our data for training, we need to preprocess the text. This involves tokenization, converting to lowercase, and removing stopwords.
from nltk.corpus import stopwordsfrom nltk.tokenize import word_tokenizenltk.download('stopwords')nltk.download('punkt')stop_words = set(stopwords.words('english'))def preprocess(document):words = word_tokenize(document)words = [w.lower() for w in words if w.isalpha() and w.lower() not in stop_words]return " ".join(words)documents = [(preprocess(" ".join(words)), category) for (words, category) in documents]
The code above is explained in detail below:
Lines 1–2: Import necessary modules for stopwords and tokenization.
Line 4: Download NLTK’s stopwords corpus, which contains common words like “the,” “is,” and “and,” typically removed during text processing for tasks like classification or retrieval.
Line 5: Retrieve NLTK’s pretrained punkt
tokenizer models, used for breaking text into words or sentences, facilitating tasks such as tokenization.
Line 6: Initialize a set of English stopwords using NLTK. This efficiently enables quick checks for stopword presence in text, which is essential for processing large volumes of textual data.
Lines 8–11: Define a preprocess
function to tokenize, lowercase, and filter stopwords from a given document.
Line 13: Apply the preprocess
function to each document in the existing list of tuples (documents
). Processed documents are paired with their corresponding categories.
Now, we split our dataset into training and testing sets to evaluate the performance of our sentiment classifier.
features = [d for (d, c) in documents]labels = [c for (d, c) in documents]features_train, features_test, labels_train, labels_test = train_test_split(features, labels, test_size=0.1, random_state=42)
The code above is explained in detail below:
Line 1: Create a list, features
, containing the document elements from the documents
list.
Line 2: Create a list, labels
, containing the category/label elements from the documents
list.
Line 4: Use the train_test_split
function to split the features (features
) and labels (labels
) into training and testing sets.
features_train
: The training set of features
features_test
: The testing set of features
labels_train
: The training set of labels
labels_test
: The testing set of labels
test_size=0.1
: Specifies that 10% of the data should be used for testing
random_state=42
: Sets the random seed for reproducibility
We use the TF-IDF (Term Frequency-Inverse Document Frequency) vectorizer to convert the text data into numerical features.
vectorizer = TfidfVectorizer()features_train_tfidf = vectorizer.fit_transform(features_train)features_test_tfidf = vectorizer.transform(features_test)
The code above is explained in detail below:
Line 1: Initialize a TF-IDF vectorizer.
Line 2: Apply vectorizer
to the training set using the fit_transform
function, obtaining features_train_tfidf
.
Line 3: Apply the same vectorizer to the testing set using the transform
function, obtaining features_test_tfidf
.
Now, we train an SVM classifier using scikit-learn’s SVC.
from sklearn.svm import SVC# Convert the sparse matrix to a dense arrayfeatures_train_array = features_train_tfidf.toarray()# Create and train the SVM classifiersvm_classifier = SVC(kernel='linear')svm_classifier.fit(features_train_array, labels_train)# Convert the sparse matrix to a dense arrayfeatures_test_array = features_test_tfidf.toarray()
The code above is explained in detail below:
Line 1: Import the support vector machine (SVM) classifier from scikit-learn.
Line 4: Convert the sparse TF-IDF matrix of training features (features_train_tfidf
) into a dense NumPy array (features_train_array
).
Line 7: Create an SVM classifier with a linear kernel.
Line 8: Train the SVM classifier on the training data using the dense array of TF-IDF features (features_train_array
) and corresponding labels (labels_train
).
Line 11: Convert the sparse TF-IDF matrix of testing features (features_test_tfidf
) into a dense NumPy array (features_test_array
).
Evaluate the classifier’s performance by calculating the accuracy of the test set.
predictions = svm_classifier.predict(features_test_array)accuracy = accuracy_score(labels_test, predictions)print("Accuracy:", accuracy)
The code above is explained in detail below:
Line 1: Use the trained SVM classifier to predict labels for the testing features (features_test_array
). Predictions are stored in the predictions
variable.
Line 2: Calculate the accuracy of the predictions by comparing them to the true labels (labels_test
). The result is stored in the accuracy
variable.
Line 3: Print the accuracy of the SVM classifier on the testing set.
In the following widget, click the “Run” button to launch the Jupyter Notebook. For a better viewing experience, open the Jupyter Notebook in a new tab by clicking on the widget link.
# Step 1: Import libraries import nltk from nltk.corpus import movie_reviews from nltk.classify.scikitlearn import SklearnClassifier from sklearn.svm import SVC from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score import random # Step 2: Load the dataset nltk.download('movie_reviews') documents = [(list(movie_reviews.words(fileid)), category) for category in movie_reviews.categories() for fileid in movie_reviews.fileids(category)] random.shuffle(documents) # Step 3: Data preprocessing from nltk.corpus import stopwords from nltk.tokenize import word_tokenize nltk.download('stopwords') nltk.download('punkt') stop_words = set(stopwords.words('english')) def preprocess(document): words = word_tokenize(document) words = [w.lower() for w in words if w.isalpha() and w.lower() not in stop_words] return " ".join(words) documents = [(preprocess(" ".join(words)), category) for (words, category) in documents] # Step 4: Split the dataset features = [d for (d, c) in documents] labels = [c for (d, c) in documents] features_train, features_test, labels_train, labels_test = train_test_split(features, labels, test_size=0.1, random_state=42) # Step 5: Build a feature set vectorizer = TfidfVectorizer() features_train_tfidf = vectorizer.fit_transform(features_train) features_test_tfidf = vectorizer.transform(features_test) # Step 6: Train the model from sklearn.svm import SVC # Convert the sparse matrix to a dense array features_train_array = features_train_tfidf.toarray() # Create and train the SVM classifier svm_classifier = SVC(kernel='linear') svm_classifier.fit(features_train_array, labels_train) # Convert the sparse matrix to a dense array features_test_array = features_test_tfidf.toarray() # Step 7: Evaluate the model predictions = svm_classifier.predict(features_test_array) accuracy = accuracy_score(labels_test, predictions) print("Accuracy:", accuracy)
Note: If you aim for the best accuracy, consider exploring model tuning techniques, experimenting with different algorithms, or adjusting hyperparameters. Remember that achieving a high overall accuracy does not guarantee accuracy for every prediction. Model refinement is an iterative process, and fine-tuning may be necessary for specific use cases or text data types.