Data Scrubbing Operation: Drop Missing Values

We will cover ways of removing missing data values.

Quick overview: Another common but more complicated problem is deciding what to do with missing data. Missing data can be split into three categories:

  • Missing completely at random (MCAR)
  • Missing at random (MAR)
  • Nonignorable.

MCARMissing Completely at Random occurs when there’s no relationship between a missing value and other values in the dataset. Oftentimes, the value is not readily available and is therefore left out of the dataset.

MARMissing at Random means the missing value is not related to its own value but is instead related to the values of other variables. In census surveys, for example, a respondent might skip an extended response question because relevant information was inputted in a previous question, or they fail to complete the census survey due to low levels of language proficiency as stated by the respondent elsewhere in the survey.

In other words, the reason why the value is missing is linked to another variable in the dataset and not due directly to the value itself.

Lastly, nonignorable missing data constitutes the absence of data due directly to its own value or significance of the information. For example, tax-evading citizens or respondents with a criminal record may decline to supply information to certain questions due to feelings of sensitivity towards that question. The irony of these three categories is that it’s difficult to diagnose why the data is missing because the data is missing.

Problem-solving skills and awareness of these three categories can help diagnose and correct the root cause of missing values. This might include rewording surveys for second-language speakers to solve data missing at random or redesigning data collection methods, such as observing sensitive information rather than asking for this information directly from participants, to find nonignorable missing values.

A rough understanding of why certain data is missing can also help to influence how we manage and treat missing values. If male participants, for example, are more willing to supply information about their salary than female participants, this would eliminate using the mean (of mostly male respondents) from the existing data to populate the missing values (of mostly female respondents).

Managing MCAR is relatively straightforward as the data values collected can be considered a random sample and are more easily aggregated or estimated. We’ll discuss common methods for filling missing values in this chapter, but first let’s review the code in Python for inspecting missing values.

df.isnull().sum()

Get hands-on with 1200+ tech skills courses.