Search⌘ K
AI Features

Machine Learning and Imbalanced Data

Explore methods to handle imbalanced datasets in multiclass classification using logistic regression. Learn to identify accuracy pitfalls, apply oversampling, and use SMOTE to create synthetic samples, improving model recall and generalization on minority classes.

Since we have the features and the targets from our previous lesson, let's split them into train and test datasets.

Imbalance data

Let's also check the class imbalance for our training data.

Python 3.8
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=101
)
print(y_train.value_counts(), '\n',
"Minority class (Active) is only {} % in the training set".format(
round(y_train.value_counts()[1] / len(y_train) * 100, 2)
)
)

With that, let's train a logistic regression model.

Python 3.8
from sklearn.linear_model import LogisticRegression
# Creating model instances
logR = LogisticRegression(max_iter=10000)
# fitting the model
logR.fit(X_train,y_train)
# Accuracy Score
print("Accuracy Score for (X_train, y_train):",logR.score(X_train,y_train))

The numbers look impressive with an accuracy of ~98%. The minority class is only 1.5%, making the baseline ...