Python is a multipurpose programming language and a great tool for data analysis. While analyzing data in Python, we may sometimes encounter a situation in which there are redundant or duplicate values. In such situations, we need to delete/remove the redundant or duplicate data values. To do this, we use the drop_duplicates()
method. This method helps us remove repeating values from a DataFrame.
DataFrame.drop_duplicates(subset = None, keep ='first', inplace = False, ignore_index=False)
subset
: This is used to get the columns or list of columns. By default, it is None
. When column(s) are passed to it, it removes duplicate rows of that specific column.keep
: This is the parameter that determines which value to keep. first
which means that the first value is considered original, and the rest are considered duplicates. last
, in which case the last value will be considered original, and the rest of the values will be considered duplicates. false
, it will consider all redundant values to be duplicates and remove them all.Inplace
: This is a boolean value. When set to true
, it removes repetitive rows.ignore_index
: This is a boolean value, and its default value is false
. When set to true
, the index labels are not used.The method will return the value of the arguments as a result. This value can be a Dataframe, series, or ndarray.
Let’s understand this with an example. In the code snippet below, we are going to filter the existing DataFrame df
and exclude the redundant observations.
Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig 1461,20,RH,80,11622,Pave,NA,Reg,Lvl,AllPub,Inside 1463,60,RL,74,13830,Pave,NA,IR1,Lvl,AllPub,Inside 1464,60,RL,78,9978,Pave,NA,IR1,Lvl,AllPub,Inside 1465,120,RL,43,5005,Pave,NA,IR1,HLS,AllPub,Inside 1467,20,RL,NA,7980,Pave,NA,IR1,Lvl,AllPub,Inside 1463,60,RL,74,13830,Pave,NA,IR1,Lvl,AllPub,Inside 1464,60,RL,78,9978,Pave,NA,IR1,Lvl,AllPub,Inside 1465,120,RL,43,5005,Pave,NA,IR1,HLS,AllPub,Inside 1466,60,RL,75,10000,Pave,NA,IR1,Lvl,AllPub,Corner
pd.read_csv()
function to read employee.csv
data as a DataFrame.df.drop_duplicates()
to remove duplicate data from DataFrame. By default, it will keep the first row and remove the redundant rows.Now, let’s try different parameters.
In this example, we remove all the rows with repetitive values in all the columns except the first value.
Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig 1461,20,RH,80,11622,Pave,NA,Reg,Lvl,AllPub,Inside 1463,60,RL,74,13830,Pave,NA,IR1,Lvl,AllPub,Inside 1464,60,RL,78,9978,Pave,NA,IR1,Lvl,AllPub,Inside 1465,120,RL,43,5005,Pave,NA,IR1,HLS,AllPub,Inside 1467,20,RL,NA,7980,Pave,NA,IR1,Lvl,AllPub,Inside 1463,60,RL,74,13830,Pave,NA,IR1,Lvl,AllPub,Inside 1464,60,RL,78,9978,Pave,NA,IR1,Lvl,AllPub,Inside 1465,120,RL,43,5005,Pave,NA,IR1,HLS,AllPub,Inside 1466,60,RL,75,10000,Pave,NA,IR1,Lvl,AllPub,Corner
In this example, we remove all rows with repetitive values in all the columns except the last values.
Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig 1461,20,RH,80,11622,Pave,NA,Reg,Lvl,AllPub,Inside 1463,60,RL,74,13830,Pave,NA,IR1,Lvl,AllPub,Inside 1464,60,RL,78,9978,Pave,NA,IR1,Lvl,AllPub,Inside 1465,120,RL,43,5005,Pave,NA,IR1,HLS,AllPub,Inside 1467,20,RL,NA,7980,Pave,NA,IR1,Lvl,AllPub,Inside 1463,60,RL,74,13830,Pave,NA,IR1,Lvl,AllPub,Inside 1464,60,RL,78,9978,Pave,NA,IR1,Lvl,AllPub,Inside 1465,120,RL,43,5005,Pave,NA,IR1,HLS,AllPub,Inside 1466,60,RL,75,10000,Pave,NA,IR1,Lvl,AllPub,Corner
In this example, we remove all rows with repetitive values in all the rows.
Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig 1461,20,RH,80,11622,Pave,NA,Reg,Lvl,AllPub,Inside 1463,60,RL,74,13830,Pave,NA,IR1,Lvl,AllPub,Inside 1464,60,RL,78,9978,Pave,NA,IR1,Lvl,AllPub,Inside 1465,120,RL,43,5005,Pave,NA,IR1,HLS,AllPub,Inside 1467,20,RL,NA,7980,Pave,NA,IR1,Lvl,AllPub,Inside 1463,60,RL,74,13830,Pave,NA,IR1,Lvl,AllPub,Inside 1464,60,RL,78,9978,Pave,NA,IR1,Lvl,AllPub,Inside 1465,120,RL,43,5005,Pave,NA,IR1,HLS,AllPub,Inside 1466,60,RL,75,10000,Pave,NA,IR1,Lvl,AllPub,Corner
RELATED TAGS
CONTRIBUTOR
View all Courses