Sentiment Analysis Using Multinomial Logistic Regression

Sentiment Analysis Using Multinomial Logistic Regression

Multinomial logistic regression is a statistical method used to analyze the relationship between multiple categorical dependent variables and a set of independent variables. It extends the binary logistic regression model to handle three or more categories. The model predicts the probabilities of each category based on the independent variables by estimating coefficients for each category. It employs a softmax function to convert the linear combinations of variables and coefficients into probabilities, allowing for assigning observations to the category with the highest probability. Multinomial logistic regression has applications in various fields and helps understand the factors influencing categorical outcomes and make predictions about different categories.

In this project, we'll build a multiclass classifier from scratch for sentiment analysis using multinomial logistic regression with the Twitter Tweets Sentiment Dataset. We'll preprocess the data by removing the punctuation and converting the tweets into a bag of words. We'll then build a vocabulary based on the most frequent words in the dataset and convert the tweets into feature vectors by using the CountVectorizer function from the scikit-learn library. Subsequently, we'll split the dataset into training and testing subsets with stratified sampling and then we'll implement the multinomial logistic regression classifier. Finally, we'll train and evaluate the model using the training and testing subsets, compute evaluation metrics, and display a confusion matrix and a classification report.