How to handle datasets with missing or corrupted data

In the machine learning cycle, data preparation is an important step towards exploring and analyzing data. It is essential to handle missing or corrupt data in order to have a clean data set that one can build accurate models on or draw concrete conclusions from. We will explore how missing or corrupted data can be handled. We will be using the Python pandas module to demonstrate each of the methods highlighted below.

Handling missing data

If data is missing, follow these steps:

Remove data: You can remove the rows with missing data (null or NaN values) from the dataset. This means that you calculate the mean, median, or mode of each feature and replace missing values in a column with these statistics. Removing data is done when the missing data rows are very less in number and removing them from the dataset does not impact the data in a drastic manner. The disadvantage of this method is that you lose information. Below is an example of how you can do this using the dropna() function in pandas.

Impute with mean, median, or mode: The null values can be replaced by a relevant mean, median, or mode value. Imputation preserves data, compared to the first method where all values are deleted. This means that the column with missing data must be of numeric type so that we can replace it with these statistics. Imputation preserves data, compared to the first method where all values are deleted. However, the disadvantage is that we unknowingly add bias and variance to the dataset. Below is an example of how you can impute values using the replace function in pandas.

Impute using k-nearest neighbors: This method can impute values based on k-nearest neighbors in that column. It calculates a weighted average of its k-nearest neighbors and replaces the missing values. The k-nearest neighbors in a dataset are found using the Euclidean distance between each data point. It can take a lot of time to apply the KNN machine learning algorithm to the data, calculate values for each of the missing values, and replace them. Below is an example of how you can do this using the sklearn module. First, we use the function KNNImputer to set the imputer to the number of neighbors it wants to take into account. Then, we fit the data according to this imputer.

How to handle datasets with missing or corrupted data

Handling missing data

Handling corrupted data