Inconsistent Data
This lesson will focus on some of the common inconsistencies present in datasets and how to deal with them using pandas.
We'll cover the following...
Inconsistency in data arises due to errors in collecting data. For instance, if the data was collected from multiple sources, or if the data was collected by multiple people who did not follow the same format of collecting data, then there is a high chance of inconsistencies in the data.
In this lesson, we will be cleaning the Credit Cards Default Dataset. This dataset is a very good example of the kind of inconsistencies that are present in most datasets.
Credit cards default dataset
The documented details of individual columns are mentioned below. But we will see that our dataset will not be consistent with this format.
Let’s load the dataset.
Just by looking at the output, we can see that pandas keeps serial numbers for us automatically, and since we have IDs in the ID column, we do not need the first column, we can remove it. We can use the drop function to drop columns by ...