...

/

Patch the Gaps

Patch the Gaps

Learn how to handle missing data, duplicates, and outliers to make the dataset more reliable.

As data analysts, one of our biggest challenges is making sense of imperfect data. It’s like trying to build a story from a puzzle where some pieces are missing, some repeat, and others don’t even fit. Sure, we can still piece something together, but the result might be misleading or incomplete.

That’s what messy data feels like in analysis.

So far, we’ve explored why data often arrives in imperfect form and how to load and explore it using pandas. Now it’s time for the hands-on work of making it usable. In this lesson, we’ll tackle three common problems that can derail analysis:

  • Missing values

  • Duplicate records

  • Outliers

Each can distort our results in subtle but significant ways, so we’ll learn how to detect them, decide what to do, and apply practical fixes using Python.

Missing data

Sometimes, data just… disappears. Whether we’re looking at logs, event trackers, or customer databases, it’s common to find fields left blank or marked as NULL. Our job as analysts is to spot these gaps and decide the best way to handle them, whether that means filling them in, dropping them, or flagging them for further review.

Identify missing values

We begin by scanning the dataset for missing entries. In pandas, one of the most effective ways to do this is by combining the isna() method with sum(). This quickly shows us how many values are missing in each column, helping us spot potential issues early on.

Informational note: Missing values often appear during ingestion, think of corrupted CSVs, broken API calls, or optional fields in web forms. Knowing where they’re common helps build preventive checks upstream.

In the following example, we work with a dataset containing transaction-level data, including fields such as TransactionID, CustomerID, Quantity, TotalAmount, Product, and Date. Our goal is to identify the missing value count in each column.

Press + to interact
Python 3.10.4
Files
import pandas as pd
df = pd.read_csv("customer_transactions.csv")
print(df.isna().sum())
  • Line 4: This line helps identify missing values in a DataFrame. Let’s break it down:

    • df.isna() returns a DataFrame of the same size as df, with True for missing values (NaN) and False otherwise.

    • sum() is called on the DataFrame returned by df.isna(). It adds up the True values (treated as 1) in each column, effectively counting the number of missing values per column in the DataFrame.

This output reveals that Quantity and TotalAmount contain missing values that must be addressed to maintain the validity of transactional metrics.

Basic imputation strategies

When data is missing, we don’t always want to discard entire rows. Imputation lets us fill in values based on patterns or logical defaults. Each imputation technique makes assumptions about the data, so it’s important to choose the method that best fits our context. Below are five commonly used methods:

  • Median imputation: Instead of dropping rows with missing values, we fill them in using the median of each ...