Deduplication
Learn about the Deduplication pattern in SQL to remove duplicate records and keep only the most relevant data.
We'll cover the following...
When working with large datasets, we often encounter duplicate rows, whether due to data entry errors, system glitches, or imports from multiple sources.
Imagine we’re analyzing recent orders, but multiple entries exist for the same transaction. This not only skews our analysis but can also lead to misleading results in reports, metrics, and summaries. In this lesson, we’ll explore the Deduplication pattern in SQL.
By the end of this lesson, we’ll be able to identify duplicates, understand why they occur, and apply techniques to keep only the desired version of each record. We'll also explore how to choose which row to keep when duplicates vary slightly.
Pattern overview
Category:
Filtering Patterns
Intent:
To remove or filter out duplicate records from a table or result set while preserving only one desired version of each duplicate group.
Motivation:
Duplicate data leads to inaccurate calculations and cluttered reports.
Whether we're calculating the number of unique customers, fetching the most recent order, or displaying unique product listings, eliminating duplicates is often a necessary step. Let’s say we want to list the latest purchase made by each customer.
If we don’t de-duplicate properly, we may show multiple purchases per customer, making the result unreliable.
Also known as:
Distinct filter
Unique row selection
Row Deduplication
Structure
The Deduplication pattern typically follows one of these approaches:
DISTINCT
to remove complete row duplicates.GROUP BY
to group duplicate rows by key and apply aggregation.ROW_NUMBER()
orRANK()
window functions to keep a single preferred row per group.EXISTS
orNOT EXISTS
to exclude rows that are not the latest/first per group.
Keywords
DISTINCT
, GROUP BY
, ROW_NUMBER()
, RANK()
, PARTITION BY
, ORDER BY
, EXISTS
, NOT EXISTS
Problem structure
We use Deduplication when we:
We have duplicate rows in a table or result set.
We want to group by one or more columns and keep only one row per group.
We need to decide which row to keep based on specific criteria (e.g., latest date, lowest price, highest score).
We're joining or importing data and want to prevent duplication in the final result.
Look for keywords like: “remove duplicates,” “keep the latest,” “distinct rows,” “one record per group,” or “filter duplicates” to identify when this pattern applies.
Example use cases
Show a list of customers without repetition.
Retrieve the latest order per customer.
Get the cheapest supplier per product.
Clean up a table after accidental duplicate inserts.
Prepare a dataset for exporting or reporting.
1. Basic Deduplication using DISTINCT
Given the following structure of the Customers
table:
Field | Type |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Where CustomerID
is the primary key and ReferralID
is self-referencing to CustomerID
. The table contains information about customers. Write an SQL query to return a list of distinct customer names from the Customers
table.
Make sure the result doesn't contain any duplicates.
SELECT DISTINCT CustomerNameFROM Customers;
2. Deduplication using GROUP BY
Given the following table structure of Orders
table:
Field | Type |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Where OrderID
is the primary key and CustomerID
is the foreign key from the Customers
table referring to the customer who placed the order. The table contains information about the orders ...