How to check for duplicated rows of a DataFrame in Pandas
Overview
In Pandas, the duplicated() function returns a Boolean series indicating duplicated rows of a dataframe.
Syntax
The syntax for the duplicated() function is as follows:
DataFrame.duplicated(subset=None, keep='first')
Syntax for the duplicated() function
Parameters
The duplicated() function takes the following parameter values:
subset(optional): This represents a column label or sequence of labels denoting the column in which the duplicates are to be identified.keep(optional): This takes any of the values:-
"first": To mark any existing duplicate asTrueexcept for the first occurrence. "last": To mark any existing duplicate asTrueexcept for the last occurrence.-
"false": To mark all duplicates asTrue.
Return value
The duplicated() function returns a Boolean Series for each duplicated row.
By default the
duplicated()function will returnFalsefor the first occurrence of a duplicated row and will returnTruefor the other occurrence. By setting thekeep="last", the first occurrence is set asTruewhile the last occurrence is set asFalse.
Example
# A code to illustrate the duplicate() function# importing the pandas libraryimport pandas as pd# creating a dataframedf = pd.DataFrame([["THEO",1,1,3,"A"],["Theo",1,1,3,"A"],["THEO",1,1,3,"A"]],columns=list('ABCDE'))# printing the dataframeprint(df)print("\n")# to check for duplicate rowsprint(df.duplicated())print("/n")# setting first occurence as trueprint(df.duplicated(keep = "last"))print("\n")# getting duplicates on column Aprint(df.duplicated(subset = ["A"]))
Explanation
- Line 4: We'll import the
pandaslibrary. - Lines 7-10: We'll create a
dataframe,df. - Line 12: We'll print the
dataframe. - Line 16: We'll check the default values of all duplicated rows of the
dataframeusing theduplicated()function. - Line 20: We obtain the duplicated rows by returning
Truefor any first occurrence of duplicated rows using theduplicate()function and passing"last"as the parameter value ofkeep. - Line 24: We obtain the duplicated values of column
"A".