Trusted answers to developer questions
Trusted Answers to Developer Questions

Related Tags

pandas
deep learning
communitycreator

What is drop_duplicates() in pandas?

Eman Kashif

Grokking Modern System Design Interview for Engineers & Managers

Ace your System Design Interview and take your career to the next level. Learn to handle the design of applications like Netflix, Quora, Facebook, Uber, and many more in a 45-min interview. Learn the RESHADED framework for architecting web-scale applications by determining requirements, constraints, and assumptions before diving into a step-by-step design process.

Answers Code

The Python library, pandas, provides operations to manipulate and inspect data. It makes use of tabular structures called data frames, which makes the organization and analysis of data convenient.

Pandas is used widely in pre-processing of data to build machine learning models. An important part of data pre-processing is removing duplicate records. The drop_duplicates() function provided by pandas removes duplicate rows, which ensures that the data fed into the machine learning model is not redundant.

Syntax

dataframe.drop_duplicates(subset, keep, inplace, ignore_index)

Arguments

  1. subset: Specifies column label(s) to ignore. Data type must be a string. (Optional)
  2. keep: Specifies which duplicates the data frame should keep. The options are first, last, and False. First is the default. (Optional)
  3. inplace: If set to True, the operation is performed on the existing data frame. False is the default. (Optional)
  4. ignore_index: If set to True, the indexes are labelled. False is the default. (Optional)

Return value

A data frame object without the duplicate rows is returned.

Code example

#import the library
import pandas as pd
#initialize the data
data = {
"Name": ["Brad", "Lisa", "Olli", "Lisa", "Kris"],
"Age": [23, 31, 24, 31, 28],
"Vaccinated": [True, False, True, False, True]
}
#create data frame
df = pd.DataFrame(data)
#print data frame
print(df)

In the data frame above, we can see that there is a duplicate row. We use the drop_duplicates() method to remove the row that is repeated.

#remove duplicates
new_df = df.drop_duplicates()
#print new data frame
print(new_df)

RELATED TAGS

pandas
deep learning
communitycreator

Grokking Modern System Design Interview for Engineers & Managers

Ace your System Design Interview and take your career to the next level. Learn to handle the design of applications like Netflix, Quora, Facebook, Uber, and many more in a 45-min interview. Learn the RESHADED framework for architecting web-scale applications by determining requirements, constraints, and assumptions before diving into a step-by-step design process.

Answers Code
Keep Exploring