Simulating Biased Mislabeling Using Python Programming

The primary focus of this lesson is to simulate noise in a dataset and demonstrate its impact through visualization. This lesson offers hands-on learning experience simulating biased mislabeling in the MNIST digit dataset using Python programming. The lesson is divided into the following two steps:

  • Step 1: We will simulate biased mislabeling by manipulating the labels based on predefined biases or assumptions. We’ll learn to modify labels to create mislabeling based on similar features between different classes. Moreover, we’ll actively simulate noise in the dataset through the provided code examples and instructions.

  • Step 2: We will visualize the dataset after simulating biased mislabeling. We’ll also generate a bar chart to observe the distribution of mislabeled images across different digits. This visualization will help us to better understand the extent of the mislabeling and how it affects the distribution of the dataset.

Step 1: Simulating biased mislabeling in the MNIST dataset

Biased mislabeling refers to the part of the data that is incorrectly labeled based on our own ideas, assumptions, or understanding. For example, in a dataset of handwritten digits, 9 has a similar structure to 0 and 8. As a result, there is a possibility that the 9 will be mislabeled as 0 and 8. This creates biased mislabeling in a dataset. One of the reasons for such biased mislabeling is low-quality data.

Simulating 10% biased mislabeling in each class of the MNIST dataset

To simulate 10% biased mislabeling in the MNIST dataset, we randomly assign incorrect labels to 10% of the images in the training dataset of each class based on a similar structure. The code below demonstrates how to introduce biased mislabeling to the MNIST digit dataset using Python programming.

Click the “Run” button to observe the output.

Get hands-on with 1200+ tech skills courses.