Search⌘ K
AI Features

Noisy Data and Label Noise

Explore the concept of noisy data and label noise in machine learning. Understand different sources of noise, reasons for mislabeling, and types of mislabeling, including unbiased and biased. This lesson helps you recognize how noise affects model accuracy and prepares you to manage label errors effectively.

What is noise?

Noise is defined as an undesirable behavior within data. Additionally, any data that a machine cannot easily understand or correctly interpret is also considered noise. In a dataset, noise can take various forms, including outliers, measurement errors, missing values, and labeling errors. It can distort the statistical properties of the data, introduce inaccuracies, and affect the analysis or training of ML models.

Unreliable data collection tools are a common source of errors in datasets, and these errors can be categorized as noise. Such errors arise from unreliable equipment and can substantially impact the accuracy of ML models.

We cannot eliminate noise while collecting and processing data, but we can minimize the chances of error through data cleansing and transformation.

Noise sources

The three main causes of noise are as follows:

Sources of noise
Sources of noise
  • Implicit errors: This type of error is caused by the inappropriate measurement of tools, potentially due to several factors, such as calibration issues, errors in measuring different types of sensors, or inaccuracies in the measurement process.

  • Random error: This type of error reveals a significant difference between the true and the observed value. Random errors are unpredictable errors and are usually unavoidable. They can occur due to natural variability, environmental factors, or limitations in the measurement process.

  • Human error: This type of error refers to mistakes made by people while performing a task or carrying out a process. This happens when individuals lack knowledge, get distracted, become tired, or make poor decisions. Human errors can have a major impact on the outcome of a task, and they are the most common source of mistakes in many different areas.

Label noise in ML data

Label noise, also known as mislabeling, is introduced when examples or instances in a dataset are given incorrect labels that do not match their actual class or category.

The following figure demonstrates how data can be mislabeled.

Demonstration of how data can be mislabeled
Demonstration of how data can be mislabeled

In the above example, the digits 2 and 6 are mislabeled as “Label 1” and “Label 0,” respectively. In reality, their correct labels are “Label 2” and “Label 6.”

Reasons for mislabeling

Common causes of mislabeling
Common causes of mislabeling

Subjectivity

Subjectivity occurs when different people perceive or understand the same data differently. For example, the perception of affordability can be subjective. A product may be labeled and considered affordable by one person based on their income level and financial situation, but another individual with a different financial background may consider that same product to be too expensive. Therefore, we say that the affordability of a product is subjective because it can vary depending on personal preferences, economic factors, and individual circumstances.

Data-entry error

One significant factor leading to mislabeling in datasets is the occurrence of errors during the data entry process. When information is manually entered into a system or database, mistakes can occur for various reasons, including human fallibility, typos, or incorrect interpretations.

Lack of information

Another cause of mislabeling occurs when the information used to label each instance or example is insufficient. For example, accurately labeling diseases in medicine requires a comprehensive understanding of relevant information. In such cases, it becomes challenging to assign labels to instances or examples without having a sufficient amount of relevant data available.

Types of mislabeling

Types of mislabeling
Types of mislabeling

Unbiased mislabeling

Unbiased mislabeling occurs when data is labeled incorrectly without any intention or deliberate action to influence the outcomes in a certain direction. In unbiased mislabeling, if we have multiple classes, there will be an equal chance that the instance is mislabeled with any other class without favoritism or bias.

Biased mislabeling

In biased mislabeling, the data is mistakenly labeled with classes with similar features. This type of mislabeling frequently occurs in datasets where a class of the instances is difficult to recognize. Biased mislabeling creates consistent noise because the class is mislabeled as other classes with similar features. Moreover, in unbiased mislabeling, the noise may be inconsistent because every class has an equal chance of being mislabeled with any other class.

For instance, if we use the MNIST digit dataset and choose the digit 9, it’s possible to mislabel 9 as the digits 0 and 8 because the structure of the digit 9 closely resembles that of the digits 0 and 8.

Demonstration of biased mislabeling using the MNIST handwritten digit dataset
Demonstration of biased mislabeling using the MNIST handwritten digit dataset

Summary

Noise is unwanted or irrelevant information in data that degrades a ML model’s performance. Label noise, also known as mislabeling, indicates that some examples or instances in a dataset are incorrectly labeled. There are two types of mislabeling: unbiased and biased. In unbiased mislabeling, there’s an equal chance that the instance can be mislabeled with any other class. In contrast, in biased mislabeling, data is mislabeled with either classes with similar features or a close relation to the observation.

Mislabeling in ML is a serious issue that requires careful consideration and attention. We can minimize the impact of mislabeling and also improve the performance of ML models by using effective strategies to perform data quality checks.