Patch the Gaps
Learn how to handle missing data, duplicates, and outliers to make the dataset more reliable.
One of our fundamental challenges as data scientists is working with imperfect data. Imagine trying to solve a jigsaw puzzle, but some pieces are missing, others are duplicated, and a few don’t belong. We could still try to make sense of the picture, but what we end up with might be misleading or incomplete.
That’s what working with messy data feels like.
So far, we’ve explored why data often arrives in imperfect form and how to load and explore it using pandas. Now it’s time for the hands-on work of making it usable. This lesson focuses on three major issues that break down analysis:
Missing values
Duplicates
Outliers
Fixing these issues isn’t about filling in gaps or deleting data—it’s about making informed, careful decisions that balance completeness with accuracy. By mastering these skills, we ensure the quality and reliability of the datasets that form the backbone of our analyses and predictive models.
Missing data
Missing data can compromise the accuracy of results, particularly when key variables are affected. It is important to identify these gaps and apply suitable imputation strategies.
Identify missing values
We begin by scanning the dataset for missing entries. One of the most effective ways to do this in pandas is combining the isna()
method with sum()
. This shows how many values are missing in each column, helping us spot potential issues early on.
In the following example, we work with a transaction-level dataset containing fields such as TransactionID
, CustomerID
, Quantity
, TotalAmount
, Product
, and Date
. Our goal is to identify the missing value count in each column.
import pandas as pddf = pd.read_csv("customer_transactions.csv")print(df.isna().sum())
Line 4: This line helps identify missing values in a DataFrame. Let’s break it down:
df.isna()
returns a DataFrame of the same size asdf
, withTrue
for missing values (NaN) andFalse
otherwise. It helps identify where data is incomplete in our dataset.sum()
is called on the DataFrame returned bydf.isna()
. It adds up theTrue
values (treated as 1) in each column, effectively counting the number of missing values per column in the DataFrame.
This output reveals that Quantity
and TotalAmount
contain missing values that must be addressed to maintain the validity of transactional metrics.
Basic imputation strategies
When data is missing, we don’t always want to discard entire rows. Imputation lets us fill in values based on patterns or logical defaults. Each imputation technique makes assumptions about the data, so it’s important to choose the method that best fits our context. Below are five commonly used methods:
Median imputation: Instead of dropping rows with missing values, we fill them in using the median of each column. This approach helps preserve the dataset while reducing the impact of outliers.
TransactionID,CustomerID,Quantity,TotalAmount,Product,Date1001,C001,2,40.0,Wireless Mouse,2024-01-151002,C002,1,25.0,USB-C Cable,2024-01-161003,C003,,90.0,Bluetooth Speaker,2024-01-171004,C004,3,,Laptop Stand,2024-01-181005,C005,2,9999.0,WebCam Pro,2024-01-191005,C005,2,9999.0,WebCam Pro,2024-01-191006,C006,1,20.0,HDMI Cable,2024-01-201007,C007,2,50.0,Portable Charger,2024-01-211008,C008,,35.0,Mouse Pad,2024-01-221009,C009,3,75.0,Wireless Keyboard,2024-01-231010,C010,1,15.0,Stylus Pen,2024-01-241011,C011,2,45.0,Screen Protector,2024-01-251012,C012,4,88.0,Laptop Bag,2024-01-261013,C013,3,70.0,USB Hub,2024-01-271014,C014,,60.0,External Drive,2024-01-281015,C015,2,,Smartphone Stand,2024-01-291016,C016,3,65.0,Wireless Charger,2024-01-301017,C017,1,10.0,USB Light,2024-02-011018,C018,2,55.0,Webcam,2024-02-021019,C019,4,120.0,Headset,2024-02-031020,C020,2,80.0,Mechanical Keyboard,2024-02-041021,C021,3,95.0,Ergonomic Mouse,2024-02-051022,C022,1,,Phone Holder,2024-02-061023,C023,2,9999.0,Gaming Headset,2024-02-071024,C024,3,110.0,LED Monitor,2024-02-081025,C025,2,30.0,USB Extension,2024-02-091026,C026,,45.0,Laptop Riser,2024-02-101027,C027,1,15.0,Mouse Bungee,2024-02-111028,C028,1,22.0,Cable Organizer,2024-02-121029,C029,2,55.0,Portable SSD,2024-02-131030,C030,1,60.0,Monitor Arm,2024-02-14
Lines 1–2:
fillna()
replaces missing values with the ...