...

/

Tidy with SQL

Tidy with SQL

Learn how SQL can clean, deduplicate, and reshape data efficiently for analysis.

We’ve explored how Python and pandas give us hands-on, flexible tools for inspecting, shaping, and scaling data, which is ideal for exploration and analysis workflows. But in real-world projects, especially when data lives in big databases or flows through visualizations and pipelinesA data pipeline is an automated system that moves and transforms raw data through various stages until it's ready for analysis or use., we often need something faster and closer to the source. That’s where SQL comes in.

SQL lets us clean, transform, and prepare data inside the database, so we’re not dragging huge files around or writing extra code to process them. It’s fast, efficient, and built for working at scale.

Cleaning data at the source with SQL is often called “in-database processing.” It’s incredibly efficient because we avoid moving massive amounts of data out of the database for cleaning.

In this lesson, we’ll learn how to clean and organize data using SQL for analysis so it’s ready for whatever insights we need to extract next.

Cleaning data using SQL

Data must be clean, consistent, and analysis-ready to build reliable insights. SQL allows us to perform essential cleaning tasks directly within the database, saving time and improving scalability.

We’ll use a simple customers dataset to demonstrate two common and important SQL-based cleaning steps:

  1. Fixing inconsistent text values: Use functions like TRIM(), UPPER(), and LOWER() to standardize names and emails, which is critical for accurate filtering, grouping, and joining.

  2. Handling missing values: Use IS NULL and CASE WHEN, either by filling gaps with placeholders or flagging missing entries to support downstream analysis.

Fun fact: Spotting a single typo in a million-row dataset with SQL feels like finding a needle in a haystack, but with a powerful magnet.

Let’s illustrate this with a practical example. Below is a simplified customer table that we’ll gradually clean using SQL.

Sample Dataset: Customers

Customer_id

Name

Age

Email

Total_spend

Join_date

1

'Alice '

28

'ALICE@example.COM'

300

2022-01-15

2

'bob'

NULL

NULL

-50

2021-09-10

3

'CHARLIE'

35

'charlie@example.com'

800

2020-07-22

4

NULL

29

'dan@example.com '

450

2022-11-01

1. Fixing inconsistent values

Inconsistent casing and extra spaces may seem small, but they can seriously throw off the analysis. Think of trying to group by name when we have "Alice", "alice", and " Alice "; SQL treats them as different values.

Let’s look at three SQL functions that help fix these issues:

  • TRIM(): Removes unwanted spaces from both ends of a string. Useful when values are accidentally saved with leading or trailing blank characters.

  • UPPER(): Converts all letters in a string to uppercase. This helps standardize fields like names for consistent grouping or comparisons.

  • LOWER(): Converts all letters in a string to lowercase. This is especially helpful for fields like emails, where case doesn’t matter but consistency does.

Together, these functions clean up the text data and make it more analysis-friendly.

Press + to interact
MySQL 8.0
UPDATE customers
SET
name = TRIM(UPPER(name)),
email = TRIM(LOWER(email))
WHERE name IS NOT NULL OR email IS NOT NULL;
SELECT customer_id, name, email
FROM customers;
  • Line 1: UPDATE customers target the customers table for updating data using UPDATE command.

  • Line 2: SET begin ...