How to remove duplicate rows in DataFrame

Overview

Python is a multipurpose programming language and a great tool for data analysis. While analyzing data in Python, we may sometimes encounter a situation in which there are redundant or duplicate values. In such situations, we need to delete/remove the redundant or duplicate data values. To do this, we use the drop_duplicates() method. This method helps us remove repeating values from a DataFrame.

Syntax

Parameters

subset: This is used to get the columns or list of columns. By default, it is None. When column(s) are passed to it, it removes duplicate rows of that specific column.
keep: This is the parameter that determines which value to keep.

By default, it is set to first which means that the first value is considered original, and the rest are considered duplicates.
We can also set it to last, in which case the last value will be considered original, and the rest of the values will be considered duplicates.
If we set it to false, it will consider all redundant values to be duplicates and remove them all.

Inplace: This is a boolean value. When set to true, it removes repetitive rows.
ignore_index : This is a boolean value, and its default value is false. When set to true, the index labels are not used.

Return value

The method will return the value of the arguments as a result. This value can be a Dataframe, series, or ndarray.

Explanation

Let’s understand this with an example. In the code snippet below, we are going to filter the existing DataFrame df and exclude the redundant observations.

main.py

employee.csv

Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig
1461,20,RH,80,11622,Pave,NA,Reg,Lvl,AllPub,Inside
1463,60,RL,74,13830,Pave,NA,IR1,Lvl,AllPub,Inside
1464,60,RL,78,9978,Pave,NA,IR1,Lvl,AllPub,Inside
1465,120,RL,43,5005,Pave,NA,IR1,HLS,AllPub,Inside
1467,20,RL,NA,7980,Pave,NA,IR1,Lvl,AllPub,Inside
1463,60,RL,74,13830,Pave,NA,IR1,Lvl,AllPub,Inside
1464,60,RL,78,9978,Pave,NA,IR1,Lvl,AllPub,Inside
1465,120,RL,43,5005,Pave,NA,IR1,HLS,AllPub,Inside
1466,60,RL,75,10000,Pave,NA,IR1,Lvl,AllPub,Corner

main.py

employee.csv

Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig
1461,20,RH,80,11622,Pave,NA,Reg,Lvl,AllPub,Inside
1463,60,RL,74,13830,Pave,NA,IR1,Lvl,AllPub,Inside
1464,60,RL,78,9978,Pave,NA,IR1,Lvl,AllPub,Inside
1465,120,RL,43,5005,Pave,NA,IR1,HLS,AllPub,Inside
1467,20,RL,NA,7980,Pave,NA,IR1,Lvl,AllPub,Inside
1463,60,RL,74,13830,Pave,NA,IR1,Lvl,AllPub,Inside
1464,60,RL,78,9978,Pave,NA,IR1,Lvl,AllPub,Inside
1465,120,RL,43,5005,Pave,NA,IR1,HLS,AllPub,Inside
1466,60,RL,75,10000,Pave,NA,IR1,Lvl,AllPub,Corner

main.py

employee.csv

Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig
1461,20,RH,80,11622,Pave,NA,Reg,Lvl,AllPub,Inside
1463,60,RL,74,13830,Pave,NA,IR1,Lvl,AllPub,Inside
1464,60,RL,78,9978,Pave,NA,IR1,Lvl,AllPub,Inside
1465,120,RL,43,5005,Pave,NA,IR1,HLS,AllPub,Inside
1467,20,RL,NA,7980,Pave,NA,IR1,Lvl,AllPub,Inside
1463,60,RL,74,13830,Pave,NA,IR1,Lvl,AllPub,Inside
1464,60,RL,78,9978,Pave,NA,IR1,Lvl,AllPub,Inside
1465,120,RL,43,5005,Pave,NA,IR1,HLS,AllPub,Inside
1466,60,RL,75,10000,Pave,NA,IR1,Lvl,AllPub,Corner

main.py

employee.csv

Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig
1461,20,RH,80,11622,Pave,NA,Reg,Lvl,AllPub,Inside
1463,60,RL,74,13830,Pave,NA,IR1,Lvl,AllPub,Inside
1464,60,RL,78,9978,Pave,NA,IR1,Lvl,AllPub,Inside
1465,120,RL,43,5005,Pave,NA,IR1,HLS,AllPub,Inside
1467,20,RL,NA,7980,Pave,NA,IR1,Lvl,AllPub,Inside
1463,60,RL,74,13830,Pave,NA,IR1,Lvl,AllPub,Inside
1464,60,RL,78,9978,Pave,NA,IR1,Lvl,AllPub,Inside
1465,120,RL,43,5005,Pave,NA,IR1,HLS,AllPub,Inside
1466,60,RL,75,10000,Pave,NA,IR1,Lvl,AllPub,Corner

How to remove duplicate rows in DataFrame

Overview

Syntax

Parameters

Return value

Explanation

Remove duplicates but keep the first rows

Remove duplicates but keep the last rows

Remove all duplicates