How to find and remove duplicate data in pandas
A common task in Data Science and Analysis is the identification and removal of duplicate data. In most cases, duplicate data is of little use and may cause your analysis to go in the wrong direction. This is why it is important to know how to identify and remove duplicate data.
We will be using pandas to identify and remove the duplicate data from the data frame. Take a look at the code snippet below:
import pandas as pduser_cols = ['user_id', 'age', 'gender','occupation', 'zip_code']users = pd.read_table('http://bit.ly/movieusers',sep='|', header=None,names=user_cols, index_col='user_id')print("\nDuplicate Zip Codes:")print(users.zip_code.duplicated().tail())print("\nNumber of Duplicate Zip Codes:")print(users.zip_code.duplicated().sum())print("\nDuplicate Rows:")print(users.duplicated().tail())print("\nNumber of Duplicate Rows:")print(users.duplicated().sum())print("\nTotal number of Rows:", users.shape[0])users = users.drop_duplicates()print("\nTotal number of Unique Rows :", users.shape[0])
Explanation:
-
In line 1, we import the required package.
-
In line 3, we create a list of column names that are present in our data.
-
In line 6, we read the data as a data frame and pass the column names and index. At this point, we have our data loaded as a data frame in
df. -
In line 11, we print whether there are any duplicates (
Trueindicating duplicate,Falseindicating unique) in thezip_codecolumn. We then print the last five entries in our data frame. Here, we can see that, of those last five entries, there is onezip_codethat is a duplicate. -
In line 14, we print the number of duplicate values in the
zip_codecolumn by using thesum()function. In the sum,Truerepresents1and False represents0. -
In line 17, we print whether there is an entire duplicate row in the data frame. Note that, here, we have not used any column name before using the
duplicated()function. We then print the last five rows. In the output, we can see that the last five rows are not duplicates. -
In line 20, we print the number of rows that are duplicates using the
sum()function.Now that we have identified the duplicate data in our data frame, it is time to remove the duplicates.
-
In line 23, we use the function
drop_duplicates()on the entire data frame. This will remove all of the duplicate rows from the data frame and only return the unique rows. We can verify this by looking at the number of rows before and after removing the duplicates.
In this way, we can easily identify and remove the duplicate data from our data frame in pandas.