Sentiment analysis of movie reviews using NLTK

Sentiment analysis is a natural language processing (NLP) technique that determines the sentiment expressed in a text. This Answer will explore how to perform sentiment analysis using NLTK’s Movie Review dataset. We will use the natural language toolkit (NLTK) and scikit-learn to build a sentiment classifier based on support vector machines (SVM).

Support vector machines (SVMs) are a popular class of supervised learning algorithms for classification and regression tasks. In sentiment analysis, SVMs are often chosen for their effectiveness in handling high-dimensional data, such as text data represented as TF-IDF vectors.

Here’s a brief explanation of why SVMs are suitable for sentiment analysis:

  • Effective in high-dimensional spaces: In sentiment analysis, each document (e.g., movie review) is represented as a high-dimensional vector, where each dimension corresponds to a unique word or term. SVMs are effective in high-dimensional spaces, making them well-suited for tasks like text classification.

  • Effective for linear and nonlinear separation: SVMs aim to find the hyperplane that best separates the data points belonging to different classes (positive and negative sentiments, in this case). Using appropriate kernel functions, SVMs can handle linear and nonlinearly separable data.

  • Robustness to overfitting: SVMs have regularization parameters that help control overfitting, making them robust to noisy or complex datasets.

  • Global optimal solution: SVMs aim to find the hyperplane that maximizes the margin between classes, which leads to better generalization and robustness of the model.

  • Fewer training examples are required: SVMs are effective even when the number of training examples is smaller than the number of features, which is common in text data where the vocabulary size can be large.

Problem statement

We aim to build a sentiment analysis model to classify movie reviews into positive or negative sentiments. We will use a dataset of movie reviews labeled as positive or negative to train and evaluate our sentiment classifier.

Steps to solve the above problem
Steps to solve the above problem

Step 1: Import libraries

Let’s start by importing the required libraries for our sentiment analysis project. We’ll use NLTK for natural language processing tasks and scikit-learn for machine learning tasks.

import nltk
from nltk.corpus import movie_reviews
from nltk.classify.scikitlearn import SklearnClassifier
from sklearn.svm import SVC
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import random

Step 2: Load the dataset

Next, we need to load the Movie Review dataset provided by NLTK. This dataset contains reviews labeled as positive or negative.

nltk.download('movie_reviews')
documents = [(list(movie_reviews.words(fileid)), category)
for category in movie_reviews.categories()
for fileid in movie_reviews.fileids(category)]
random.shuffle(documents)

The code above is explained in detail below:

  • Line 1: Download the 'movie_reviews' dataset using NLTK.

  • Lines 2–4: Create a list of tuples, each containing a movie review’s word list and its category (positive or negative).

  • Line 5: Shuffle the list of tuples to randomize the order of reviews for later splitting into training and testing sets.

Step 3: Data preprocessing

To prepare our data for training, we need to preprocess the text. This involves tokenization, converting to lowercase, and removing stopwords.

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
nltk.download('stopwords')
nltk.download('punkt')
stop_words = set(stopwords.words('english'))
def preprocess(document):
words = word_tokenize(document)
words = [w.lower() for w in words if w.isalpha() and w.lower() not in stop_words]
return " ".join(words)
documents = [(preprocess(" ".join(words)), category) for (words, category) in documents]

The code above is explained in detail below:

  • Lines 1–2: Import necessary modules for stopwords and tokenization.

  • Line 4: Download NLTK’s stopwords corpus, which contains common words like “the,” “is,” and “and,” typically removed during text processing for tasks like classification or retrieval.

  • Line 5: Retrieve NLTK’s pretrained punkt tokenizer models, used for breaking text into words or sentences, facilitating tasks such as tokenization.

  • Line 6: Initialize a set of English stopwords using NLTK. This efficiently enables quick checks for stopword presence in text, which is essential for processing large volumes of textual data.

  • Lines 8–11: Define a preprocess function to tokenize, lowercase, and filter stopwords from a given document.

  • Line 13: Apply the preprocess function to each document in the existing list of tuples (documents). Processed documents are paired with their corresponding categories.

Step 4: Split the dataset

Now, we split our dataset into training and testing sets to evaluate the performance of our sentiment classifier.

features = [d for (d, c) in documents]
labels = [c for (d, c) in documents]
features_train, features_test, labels_train, labels_test = train_test_split(features, labels, test_size=0.1, random_state=42)

The code above is explained in detail below:

  • Line 1: Create a list, features, containing the document elements from the documents list.

  • Line 2: Create a list, labels, containing the category/label elements from the documents list.

  • Line 4: Use the train_test_split function to split the features (features) and labels (labels) into training and testing sets.

    • features_train: The training set of features

    • features_test: The testing set of features

    • labels_train: The training set of labels

    • labels_test: The testing set of labels

    • test_size=0.1: Specifies that 10% of the data should be used for testing

    • random_state=42: Sets the random seed for reproducibility

Step 5: Build a feature set

We use the TF-IDF (Term Frequency-Inverse Document Frequency) vectorizer to convert the text data into numerical features.

vectorizer = TfidfVectorizer()
features_train_tfidf = vectorizer.fit_transform(features_train)
features_test_tfidf = vectorizer.transform(features_test)

The code above is explained in detail below:

  • Line 1: Initialize a TF-IDF vectorizer.

  • Line 2: Apply vectorizer to the training set using the fit_transform function, obtaining features_train_tfidf.

  • Line 3: Apply the same vectorizer to the testing set using the transform function, obtaining features_test_tfidf.

Step 6: Train the model

Now, we train an SVM classifier using scikit-learn’s SVC.

from sklearn.svm import SVC
# Convert the sparse matrix to a dense array
features_train_array = features_train_tfidf.toarray()
# Create and train the SVM classifier
svm_classifier = SVC(kernel='linear')
svm_classifier.fit(features_train_array, labels_train)
# Convert the sparse matrix to a dense array
features_test_array = features_test_tfidf.toarray()

The code above is explained in detail below:

  • Line 1: Import the support vector machine (SVM) classifier from scikit-learn.

  • Line 4: Convert the sparse TF-IDF matrix of training features (features_train_tfidf) into a dense NumPy array (features_train_array).

  • Line 7: Create an SVM classifier with a linear kernel.

  • Line 8: Train the SVM classifier on the training data using the dense array of TF-IDF features (features_train_array) and corresponding labels (labels_train).

  • Line 11: Convert the sparse TF-IDF matrix of testing features (features_test_tfidf) into a dense NumPy array (features_test_array).

Step 7: Evaluate the model

Evaluate the classifier’s performance by calculating the accuracy of the test set.

predictions = svm_classifier.predict(features_test_array)
accuracy = accuracy_score(labels_test, predictions)
print("Accuracy:", accuracy)

The code above is explained in detail below:

  • Line 1: Use the trained SVM classifier to predict labels for the testing features (features_test_array). Predictions are stored in the predictions variable.

  • Line 2: Calculate the accuracy of the predictions by comparing them to the true labels (labels_test). The result is stored in the accuracy variable.

  • Line 3: Print the accuracy of the SVM classifier on the testing set.

Try it yourself

In the following widget, click the “Run” button to launch the Jupyter Notebook. For a better viewing experience, open the Jupyter Notebook in a new tab by clicking on the widget link.

# Step 1: Import libraries
import nltk
from nltk.corpus import movie_reviews
from nltk.classify.scikitlearn import SklearnClassifier
from sklearn.svm import SVC
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import random

# Step 2: Load the dataset
nltk.download('movie_reviews')
documents = [(list(movie_reviews.words(fileid)), category)
             for category in movie_reviews.categories()
             for fileid in movie_reviews.fileids(category)]
random.shuffle(documents)

# Step 3: Data preprocessing
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

nltk.download('stopwords')
nltk.download('punkt')
stop_words = set(stopwords.words('english'))

def preprocess(document):
    words = word_tokenize(document)
    words = [w.lower() for w in words if w.isalpha() and w.lower() not in stop_words]
    return " ".join(words)

documents = [(preprocess(" ".join(words)), category) for (words, category) in documents]

# Step 4: Split the dataset
features = [d for (d, c) in documents]
labels = [c for (d, c) in documents]

features_train, features_test, labels_train, labels_test = train_test_split(features, labels, test_size=0.1, random_state=42)

# Step 5: Build a feature set
vectorizer = TfidfVectorizer()
features_train_tfidf = vectorizer.fit_transform(features_train)
features_test_tfidf = vectorizer.transform(features_test)

# Step 6: Train the model
from sklearn.svm import SVC

# Convert the sparse matrix to a dense array
features_train_array = features_train_tfidf.toarray()

# Create and train the SVM classifier
svm_classifier = SVC(kernel='linear')
svm_classifier.fit(features_train_array, labels_train)

# Convert the sparse matrix to a dense array
features_test_array = features_test_tfidf.toarray()

# Step 7: Evaluate the model
predictions = svm_classifier.predict(features_test_array)
accuracy = accuracy_score(labels_test, predictions)
print("Accuracy:", accuracy)
Jupyter Notebook

Note: If you aim for the best accuracy, consider exploring model tuning techniques, experimenting with different algorithms, or adjusting hyperparameters. Remember that achieving a high overall accuracy does not guarantee accuracy for every prediction. Model refinement is an iterative process, and fine-tuning may be necessary for specific use cases or text data types.


Copyright ©2024 Educative, Inc. All rights reserved