Trusted answers to developer questions
Trusted Answers to Developer Questions

Related Tags

communitycreator
dataframe

How to remove duplicate rows in DataFrame

AKASH BAJWA

Overview

Python is a multipurpose programming language and a great tool for data analysis. While analyzing data in Python, we may sometimes encounter a situation in which there are redundant or duplicate values. In such situations, we need to delete/remove the redundant or duplicate data values. To do this, we use the drop_duplicates() method. This method helps us remove repeating values from a DataFrame.

Syntax

DataFrame.drop_duplicates(subset = None, keep ='first', inplace = False, ignore_index=False)

Parameters

  • subset: This is used to get the columns or list of columns. By default, it is None. When column(s) are passed to it, it removes duplicate rows of that specific column.
  • keep: This is the parameter that determines which value to keep.
    • By default, it is set to first which means that the first value is considered original, and the rest are considered duplicates.
    • We can also set it to last, in which case the last value will be considered original, and the rest of the values will be considered duplicates.
    • If we set it to false, it will consider all redundant values to be duplicates and remove them all.
  • Inplace: This is a boolean value. When set to true, it removes repetitive rows.
  • ignore_index : This is a boolean value, and its default value is false. When set to true, the index labels are not used.

Return value

The method will return the value of the arguments as a result. This value can be a Dataframe, series, or ndarray.

Explanation

Let’s understand this with an example. In the code snippet below, we are going to filter the existing DataFrame df and exclude the redundant observations.

main.py
employee.csv
Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig
1461,20,RH,80,11622,Pave,NA,Reg,Lvl,AllPub,Inside
1463,60,RL,74,13830,Pave,NA,IR1,Lvl,AllPub,Inside
1464,60,RL,78,9978,Pave,NA,IR1,Lvl,AllPub,Inside
1465,120,RL,43,5005,Pave,NA,IR1,HLS,AllPub,Inside
1467,20,RL,NA,7980,Pave,NA,IR1,Lvl,AllPub,Inside
1463,60,RL,74,13830,Pave,NA,IR1,Lvl,AllPub,Inside
1464,60,RL,78,9978,Pave,NA,IR1,Lvl,AllPub,Inside
1465,120,RL,43,5005,Pave,NA,IR1,HLS,AllPub,Inside
1466,60,RL,75,10000,Pave,NA,IR1,Lvl,AllPub,Corner
Example of drop.duplicates() method
  • Line 5: We use the pd.read_csv() function to read employee.csv data as a DataFrame.
  • Line 10: We invoke df.drop_duplicates() to remove duplicate data from DataFrame. By default, it will keep the first row and remove the redundant rows.
  • Line 8–9: We print the updated DataFrame as an output.

Now, let’s try different parameters.

Remove duplicates but keep the first rows

In this example, we remove all the rows with repetitive values in all the columns except the first value.

main.py
employee.csv
Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig
1461,20,RH,80,11622,Pave,NA,Reg,Lvl,AllPub,Inside
1463,60,RL,74,13830,Pave,NA,IR1,Lvl,AllPub,Inside
1464,60,RL,78,9978,Pave,NA,IR1,Lvl,AllPub,Inside
1465,120,RL,43,5005,Pave,NA,IR1,HLS,AllPub,Inside
1467,20,RL,NA,7980,Pave,NA,IR1,Lvl,AllPub,Inside
1463,60,RL,74,13830,Pave,NA,IR1,Lvl,AllPub,Inside
1464,60,RL,78,9978,Pave,NA,IR1,Lvl,AllPub,Inside
1465,120,RL,43,5005,Pave,NA,IR1,HLS,AllPub,Inside
1466,60,RL,75,10000,Pave,NA,IR1,Lvl,AllPub,Corner
Remove all repeating rows except the first one

Remove duplicates but keep the last rows

In this example, we remove all rows with repetitive values in all the columns except the last values.

main.py
employee.csv
Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig
1461,20,RH,80,11622,Pave,NA,Reg,Lvl,AllPub,Inside
1463,60,RL,74,13830,Pave,NA,IR1,Lvl,AllPub,Inside
1464,60,RL,78,9978,Pave,NA,IR1,Lvl,AllPub,Inside
1465,120,RL,43,5005,Pave,NA,IR1,HLS,AllPub,Inside
1467,20,RL,NA,7980,Pave,NA,IR1,Lvl,AllPub,Inside
1463,60,RL,74,13830,Pave,NA,IR1,Lvl,AllPub,Inside
1464,60,RL,78,9978,Pave,NA,IR1,Lvl,AllPub,Inside
1465,120,RL,43,5005,Pave,NA,IR1,HLS,AllPub,Inside
1466,60,RL,75,10000,Pave,NA,IR1,Lvl,AllPub,Corner
Remove all repeating rows except the last one

Remove all duplicates

In this example, we remove all rows with repetitive values in all the rows.

main.py
employee.csv
Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig
1461,20,RH,80,11622,Pave,NA,Reg,Lvl,AllPub,Inside
1463,60,RL,74,13830,Pave,NA,IR1,Lvl,AllPub,Inside
1464,60,RL,78,9978,Pave,NA,IR1,Lvl,AllPub,Inside
1465,120,RL,43,5005,Pave,NA,IR1,HLS,AllPub,Inside
1467,20,RL,NA,7980,Pave,NA,IR1,Lvl,AllPub,Inside
1463,60,RL,74,13830,Pave,NA,IR1,Lvl,AllPub,Inside
1464,60,RL,78,9978,Pave,NA,IR1,Lvl,AllPub,Inside
1465,120,RL,43,5005,Pave,NA,IR1,HLS,AllPub,Inside
1466,60,RL,75,10000,Pave,NA,IR1,Lvl,AllPub,Corner
Remove all repeating rows

RELATED TAGS

communitycreator
dataframe
RELATED COURSES

View all Courses

Keep Exploring