How to check for duplicated rows of a DataFrame in Pandas

Overview

In Pandas, the duplicated() function returns a Boolean series indicating duplicated rows of a dataframe.

Syntax

The syntax for the duplicated() function is as follows:

Parameters

The duplicated() function takes the following parameter values:

subset (optional): This represents a column label or sequence of labels denoting the column in which the duplicates are to be identified.
keep (optional): This takes any of the values:

"first": To mark any existing duplicate as True except for the first occurrence.
"last": To mark any existing duplicate as True except for the last occurrence.
"false": To mark all duplicates as True.

Return value

The duplicated() function returns a Boolean Series for each duplicated row.

By default the duplicated() function will return False for the first occurrence of a duplicated row and will return True for the other occurrence. By setting the keep = "last", the first occurrence is set as True while the last occurrence is set as False.

Example

# A code to illustrate the duplicate() function 

# importing the pandas library
import pandas as pd

# creating a dataframe
df = pd.DataFrame([["THEO",1,1,3,"A"],
                   ["Theo",1,1,3,"A"],
                   ["THEO",1,1,3,"A"]],
                   columns=list('ABCDE'))
# printing the dataframe
print(df)

print("\n")
# to check for duplicate rows
print(df.duplicated())

print("/n")
# setting first occurence as true
print(df.duplicated(keep = "last"))

print("\n")
# getting duplicates on column A
print(df.duplicated(subset = ["A"]))