How to remove duplicate rows in DataFrame
Overview
Python is a multipurpose programming language and a great tool for data analysis. While analyzing data in Python, we may sometimes encounter a situation in which there are redundant or duplicate values. In such situations, we need to delete/remove the redundant or duplicate data values. To do this, we use the drop_duplicates() method. This method helps us remove repeating values from a DataFrame.
Syntax
DataFrame.drop_duplicates(subset = None, keep ='first', inplace = False, ignore_index=False)
Parameters
subset: This is used to get the columns or list of columns. By default, it isNone. When column(s) are passed to it, it removes duplicate rows of that specific column.keep: This is the parameter that determines which value to keep.- By default, it is set to
firstwhich means that the first value is considered original, and the rest are considered duplicates. - We can also set it to
last, in which case the last value will be considered original, and the rest of the values will be considered duplicates. - If we set it to
false, it will consider all redundant values to be duplicates and remove them all.
Inplace: This is a boolean value. When set totrue, it removes repetitive rows.ignore_index: This is a boolean value, and its default value isfalse. When set totrue, the index labels are not used.
Return value
The method will return the value of the arguments as a result. This value can be a Dataframe, series, or ndarray.
Explanation
Let’s understand this with an example. In the code snippet below, we are going to filter the existing DataFrame df and exclude the redundant observations.
Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig1461,20,RH,80,11622,Pave,NA,Reg,Lvl,AllPub,Inside1463,60,RL,74,13830,Pave,NA,IR1,Lvl,AllPub,Inside1464,60,RL,78,9978,Pave,NA,IR1,Lvl,AllPub,Inside1465,120,RL,43,5005,Pave,NA,IR1,HLS,AllPub,Inside1467,20,RL,NA,7980,Pave,NA,IR1,Lvl,AllPub,Inside1463,60,RL,74,13830,Pave,NA,IR1,Lvl,AllPub,Inside1464,60,RL,78,9978,Pave,NA,IR1,Lvl,AllPub,Inside1465,120,RL,43,5005,Pave,NA,IR1,HLS,AllPub,Inside1466,60,RL,75,10000,Pave,NA,IR1,Lvl,AllPub,Corner
- Line 5: We use the
pd.read_csv()function to reademployee.csvdata as a DataFrame. - Line 10: We invoke
df.drop_duplicates()to remove duplicate data from DataFrame. By default, it will keep the first row and remove the redundant rows. - Line 8–9: We print the updated DataFrame as an output.
Now, let’s try different parameters.
Remove duplicates but keep the first rows
In this example, we remove all the rows with repetitive values in all the columns except the first value.
Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig1461,20,RH,80,11622,Pave,NA,Reg,Lvl,AllPub,Inside1463,60,RL,74,13830,Pave,NA,IR1,Lvl,AllPub,Inside1464,60,RL,78,9978,Pave,NA,IR1,Lvl,AllPub,Inside1465,120,RL,43,5005,Pave,NA,IR1,HLS,AllPub,Inside1467,20,RL,NA,7980,Pave,NA,IR1,Lvl,AllPub,Inside1463,60,RL,74,13830,Pave,NA,IR1,Lvl,AllPub,Inside1464,60,RL,78,9978,Pave,NA,IR1,Lvl,AllPub,Inside1465,120,RL,43,5005,Pave,NA,IR1,HLS,AllPub,Inside1466,60,RL,75,10000,Pave,NA,IR1,Lvl,AllPub,Corner
Remove duplicates but keep the last rows
In this example, we remove all rows with repetitive values in all the columns except the last values.
Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig1461,20,RH,80,11622,Pave,NA,Reg,Lvl,AllPub,Inside1463,60,RL,74,13830,Pave,NA,IR1,Lvl,AllPub,Inside1464,60,RL,78,9978,Pave,NA,IR1,Lvl,AllPub,Inside1465,120,RL,43,5005,Pave,NA,IR1,HLS,AllPub,Inside1467,20,RL,NA,7980,Pave,NA,IR1,Lvl,AllPub,Inside1463,60,RL,74,13830,Pave,NA,IR1,Lvl,AllPub,Inside1464,60,RL,78,9978,Pave,NA,IR1,Lvl,AllPub,Inside1465,120,RL,43,5005,Pave,NA,IR1,HLS,AllPub,Inside1466,60,RL,75,10000,Pave,NA,IR1,Lvl,AllPub,Corner
Remove all duplicates
In this example, we remove all rows with repetitive values in all the rows.
Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig1461,20,RH,80,11622,Pave,NA,Reg,Lvl,AllPub,Inside1463,60,RL,74,13830,Pave,NA,IR1,Lvl,AllPub,Inside1464,60,RL,78,9978,Pave,NA,IR1,Lvl,AllPub,Inside1465,120,RL,43,5005,Pave,NA,IR1,HLS,AllPub,Inside1467,20,RL,NA,7980,Pave,NA,IR1,Lvl,AllPub,Inside1463,60,RL,74,13830,Pave,NA,IR1,Lvl,AllPub,Inside1464,60,RL,78,9978,Pave,NA,IR1,Lvl,AllPub,Inside1465,120,RL,43,5005,Pave,NA,IR1,HLS,AllPub,Inside1466,60,RL,75,10000,Pave,NA,IR1,Lvl,AllPub,Corner