Tidy with SQL
Learn how SQL can clean, deduplicate, and reshape data efficiently for analysis.
We'll cover the following...
We’ve explored how Python and pandas give us hands-on, flexible tools for inspecting, shaping, and scaling data, which is ideal for exploration and analysis workflows. But in real-world projects, especially when data lives in big databases or flows through visualizations and
SQL lets us clean, transform, and prepare data inside the database, so we’re not dragging huge files around or writing extra code to process them. It’s fast, efficient, and built for working at scale.
Cleaning data at the source with SQL is often called “in-database processing.” It’s incredibly efficient because we avoid moving massive amounts of data out of the database for cleaning.
In this lesson, we’ll learn how to clean and organize data using SQL for analysis so it’s ready for whatever insights we need to extract next.
Cleaning data using SQL
Data must be clean, consistent, and analysis-ready to build reliable insights. SQL allows us to perform essential cleaning tasks directly within the database, saving time and improving scalability.
We’ll use a simple customers
dataset to demonstrate two common and important SQL-based cleaning steps:
Fixing inconsistent text values: Use functions like
TRIM(),
UPPER(),
andLOWER()
to standardize names and emails, which is critical for accurate filtering, grouping, and joining.Handling missing values: Use
IS NULL
andCASE WHEN
, either by filling gaps with placeholders or flagging missing entries to support downstream analysis.
Fun fact: Spotting a single typo in a million-row dataset with SQL feels like finding a needle in a haystack, but with a powerful magnet.
Let’s illustrate this with a practical example. Below is a simplified customer table that we’ll gradually clean using SQL.
Sample Dataset: Customers
Customer_id | Name | Age | Total_spend | Join_date | |
1 | 'Alice ' | 28 | 'ALICE@example.COM' | 300 | 2022-01-15 |
2 | 'bob' | NULL | NULL | -50 | 2021-09-10 |
3 | 'CHARLIE' | 35 | 'charlie@example.com' | 800 | 2020-07-22 |
4 | NULL | 29 | 'dan@example.com ' | 450 | 2022-11-01 |
1. Fixing inconsistent values
Inconsistent casing and extra spaces may seem small, but they can seriously throw off the analysis. Think of trying to group by name when we have "Alice"
, "alice"
, and " Alice "
; SQL treats them as different values.
Let’s look at three SQL functions that help fix these issues:
TRIM()
: Removes unwanted spaces from both ends of a string. Useful when values are accidentally saved with leading or trailing blank characters.UPPER()
: Converts all letters in a string to uppercase. This helps standardize fields like names for consistent grouping or comparisons.LOWER()
: Converts all letters in a string to lowercase. This is especially helpful for fields like emails, where case doesn’t matter but consistency does.
Together, these functions clean up the text data and make it more analysis-friendly.
UPDATE customersSETname = TRIM(UPPER(name)),email = TRIM(LOWER(email))WHERE name IS NOT NULL OR email IS NOT NULL;SELECT customer_id, name, emailFROM customers;
Line 1:
UPDATE customers
target thecustomers
table for updating data usingUPDATE
command.Line 2:
SET
begin ...