Search⌘ K
AI Features

Handling Duplicate Data

Explore how to identify and handle duplicate records in datasets using Python and pandas. Learn techniques including deduplication with duplicated, removing duplicates with drop_duplicates, and custom handling with groupby to ensure clean and accurate data for analysis.

Strategies for handling duplicate data

There are several strategies for handling duplicate data, including the following:

  • Deduplication: This involves identifying and removing duplicate records from the dataset so that only a single, unique copy of each record is retained. This can be done manually by reviewing the data and identifying duplicate records or automatically using algorithms or tools to detect and remove duplicates. For example, a company with a customer database that contains multiple entries for some customers with slightly different information (e.g., different spellings of their name or address) can use Python to identify and remove duplicate records.
Remove duplicated data
Remove duplicated data
  • Data consolidation: This involves combining duplicate records into a single record so that the data is consistent and accurate. This can be done by selecting a single record as the “golden” record and merging the other duplicate records into it. For example, a hospital can streamline its data management processes and ensure that all information is accurate and up to date if the hospital decides to consolidate all of its databases into a single, unified database. This would help eliminate duplicate records and make it easier to access and update all data in one place.
Combine duplicated data
Combine duplicated data
  • Data reconciliation: This involves reconciling or resolving any inconsistencies or errors in the duplicate records so that the data is consistent and accurate. This can be done by comparing the values in the duplicate records and selecting the most accurate or up-to-date value for each field. For example, a company with a customer order database can ensure that all orders are accurately tracked and fulfilled by reconciling data from all different sales channels. This would help resolve discrepancies and duplicates.
Resolve inconsistencies in data
Resolve inconsistencies in data
  • Data standardization: This involves applying consistent rules or standards to the data so that it’s formatted and structured in a consistent and standardized way. This can be done by using algorithms or tools that automatically apply the standards to the data or by manually reviewing and updating the data to conform to the standards. For example, a government agency can have a voter registration database. Employees could enter the data in different forms (e.g., some people use abbreviations, while others spell out entire words). To make it easier to search and analyze data, they can decide to standardize data by implementing rules for how it should be entered (e.g., using entire words, no abbreviations).
Apply consistent rules to data
Apply consistent rules to data
  • Data governance: This involves establishing and enforcing policies, procedures, and controls for managing and protecting the data to be accurate, consistent, and secure. This can be done by defining roles, responsibilities, and processes for data management and implementing tools and technologies to support and enforce the data governance policies. For example, a bank can have strict data governance policies to prevent identity fraud and ensure the accuracy of their customer information. They might require that all data entry be done through standardized forms and processes and that all data be reviewed and approved by a supervisor before entering the database. This would help prevent duplicate records and ensure that the data is accurate and up to date.
Establish procedures to protect data
Establish procedures to protect data

Our focus for this course will be on applying the deduplication strategy using Python.

Finding duplicates

To find duplicates in a DataFrame, we use the duplicated() method.

C++
import pandas as pd
df = pd.read_csv("employees.csv")
duplicate_rows = df[df.duplicated()]
print(duplicate_rows)

Let’s review the code line by line:

  • Lines 1–2: We import the pandas library and load the dataset.

  • Line 3: We use the duplicated() method to find any duplicates of other records in the DataFrame. We create a new DataFrame called duplicate_rows that contains all records in the original DataFrame, df, that are duplicates. The df.duplicated() method returns a boolean series indicating whether each record is duplicated. The df[df.duplicated()] syntax is used to select only the records where the boolean series is True.

Note: By default, duplicated() marks a row as a duplicate only if all the columns of the row match with a previously observed row. This means that the comparison is done on all columns in the DataFrame unless specific columns are selected using the subset parameter.

  • Line 4: We print the duplicate records.

Removing duplicates from a DataFrame

We use the drop_duplicates() method to remove the duplicated records and keep only the unique records. This method returns a new DataFrame with the duplicated records removed.

C++

Let’s review the code line by line:

  • Lines 1–2: We first import the pandas library and load the dataset.

  • Line 3: We use the drop_duplicates() method to remove any duplicate rows from the DataFrame.

  • Line 4: We print the dataset to confirm the deletion.

From the output, we can see that the duplicate records we identified earlier have been deleted.

Removing duplicates using groupby()

We can also remove duplicate records based on specific columns in a dataset using the groupby() method, as shown:

C++
import pandas as pd
df = pd.read_csv("employees.csv")
def handle_duplicates(df):
df = df.groupby(['CITY'], as_index=False).first()
return df
df = handle_duplicates(df)
print(df)

Let’s review the code line by line:

  • Lines 1–2: We first import the pandas library and load the dataset.

  • Lines 3–5: We define a custom function handle_duplicates to handle duplicates in the DataFrame. The function takes in a DataFrame as an argument and groups the data by the CITY column using the groupby() function. We set the as_index parameter to False to keep the original index. Later, we use the first() function to keep only the first record for each group, removing duplicate rows. The modified DataFrame is then returned by the function.

  • Line 6: We call the custom function and pass the DataFrame as an argument.

  • Line 7: We print the dataset to confirm the deletion.

From the output, we can see that we removed duplicate rows in the DataFrame based on the CITY column and returned a DataFrame with unique records based on this column.