Ultimate Guide to SQL Patterns in Technical Interviews/

...

Deduplication

Learn about the Deduplication pattern in SQL to remove duplicate records and keep only the most relevant data.

We'll cover the following...

Pattern overview
Example use cases
Related patterns
Exercises

When working with large datasets, we often encounter duplicate rows, whether due to data entry errors, system glitches, or imports from multiple sources.

Imagine we’re analyzing recent orders, but multiple entries exist for the same transaction. This not only skews our analysis but can also lead to misleading results in reports, metrics, and summaries. In this lesson, we’ll explore the Deduplication pattern in SQL.

By the end of this lesson, we’ll be able to identify duplicates, understand why they occur, and apply techniques to keep only the desired version of each record. We'll also explore how to choose which row to keep when duplicates vary slightly.

Pattern overview

Category:

Filtering Patterns

Intent:

To remove or filter out duplicate records from a table or result set while preserving only one desired version of each duplicate group.

Motivation:

Duplicate data leads to inaccurate calculations and cluttered reports.

Whether we're calculating the number of unique customers, fetching the most recent order, or displaying unique product listings, eliminating duplicates is often a necessary step. Let’s say we want to list the latest purchase made by each customer.

If we don’t de-duplicate properly, we may show multiple purchases per customer, making the result unreliable.

Also known as:

Distinct filter
Unique row selection
Row Deduplication

Structure

The Deduplication pattern typically follows one of these approaches:

DISTINCT to remove complete row duplicates.
GROUP BY to group duplicate rows by key and apply aggregation.
ROW_NUMBER() or RANK() window functions to keep a single preferred row per group.
EXISTS or NOT EXISTS to exclude rows that are not the latest/first per group.

Keywords

DISTINCT, GROUP BY, ROW_NUMBER(), RANK(), PARTITION BY, ORDER BY, EXISTS, NOT EXISTS

Problem structure

We use Deduplication when we:

We have duplicate rows in a table or result set.
We want to group by one or more columns and keep only one row per group.
We need to decide which row to keep based on specific criteria (e.g., latest date, lowest price, highest score).
We're joining or importing data and want to prevent duplication in the final result.
Look for keywords like: “remove duplicates,” “keep the latest,” “distinct rows,” “one record per group,” or “filter duplicates” to identify when this pattern applies.

Example use cases

Show a list of customers without repetition.
Retrieve the latest order per customer.
Get the cheapest supplier per product.
Clean up a table after accidental duplicate inserts.
Prepare a dataset for exporting or reporting.

1. Basic Deduplication using DISTINCT

Given the following structure of the Customers table:

Press + to interact

Field	Type
`CustomerID`	`int`
`CustomerName`	`varchar(50)`
`Email`	`varchar(50)`
`Phone`	`varchar(15)`
`Address`	`varchar(100)`
`LastLogin`	`date`
`CreatedAt`	`timestamp`
`CustomerTier`	`enum('New','Regular','VIP')`
`ReferralID`	`int`
`LoyaltyPoints`	`int`
`LastPurchaseDate`	`date`
`IsChurnRisk`	`tinyint(1)`

Field	Type
`OrderID`	`int`
`CustomerID`	`int`
`OrderDate`	`date`
`TotalAmount`	`decimal(10,2)`
`ShippedDate`	`date`
`DeliveryStatus`	`enum('Pending','Shipped','Delivered','Cancelled')`
`EmployeeID`	`int`
`CreatedAt`	`timestamp`
`ExpectedDeliveryDate`	`date`
`ActualDeliveryDate`	`date`
`LateDelivery`	`tinyint(1)`
`PaymentMethod`	`enum('Credit Card','PayPal','Cash on Delivery')`
`ReturnCount`	`int`
`FraudRisk`	`tinyint(1)`

Get Started

Aggregation Patterns

Filtering Patterns

Comparison Patterns

Sequencing and Hierarchical Patterns

Transformation Patterns

Wrapping up