Dealing with Imbalanced Datasets in Python Programming

Learn about the fundamentals of imbalanced datasets and explore how to use SMOTE to effectively handle imbalanced datasets.

In this lesson, we will rectify imbalanced datasets using the MNIST dataset, focusing on the digits 0 and 1. We will investigate the impact of some classes having more examples than others and learn how this affects the model’s performance. Then, we will apply SMOTE to balance the dataset. Moreover, we will train balanced and imbalanced datasets using a CNN. Finally, we will compare the performance of these models using metrics such as accuracy, F1 score, precision, and recall.

This lesson is divided into the following three steps:

  • Step 1: Using a bar chart, we will visualize how many images of the numbers 0 and 1 are in the MNIST dataset.

  • Step 2: We will apply SMOTE to balance the imbalanced dataset and create a bar chart to show the updated distribution.

  • Step 3: We will use a CNN model to train the imbalanced and balanced datasets and measure their performance using metrics such as accuracy, F1 score, precision, and recall.

Step 1: Visualizing the MNIST dataset (digits 0 and 1)

The code provided below generates a bar chart that displays the imbalanced distribution of the digits 0 and 1 in the dataset.

Click the “Run” button to observe the imbalanced dataset’s output of the digits 0 and 1.

Get hands-on with 1200+ tech skills courses.