Why Data Needs a Cleanup
Learn why cleaning data is critical and how to identify common data issues.
We'll cover the following...
Data collected from real-world sources is seldom neat or ready-to-use. It’s usually messy, with missing pieces, mixed-up labels, odd formats, and even duplicates. It’s like trying to read a book with pages out of order and smudged words. We need to clean it up before it can truly make sense.
Data professionals devote a significant portion of their time to cleaning and preparing data. While it may not be the most appealing part of data analysis, it’s absolutely essential because even small errors can undermine the integrity of an entire analysis.
In this lesson, we’ll explore the main sources that make data dirty, the key dimensions of data quality issues we encounter, and different techniques to resolve them.
What makes data dirty?
Dirty data isn’t just an abstract concept; it refers to information that fails to accurately or consistently reflect reality. Understanding the sources of dirty data is the first step toward addressing them.
1. Human entry errors
Many datasets originate from manual inputs, web forms, spreadsheets, CRMs, or even handwritten records. These entry points are highly susceptible to small but impactful mistakes.
Misspellings and typos: A single typo like “Californa” instead of “California” can prevent accurate grouping or aggregation, leading to flawed summaries.
Transposed numbers: Typing
12,000
instead of21,000
changes the meaning of the data entirely, distorting totals and misleading any analysis based on those figures.Inconsistent labeling: Values such as “NY,” “N.Y.,” and “New York” are interpreted the same by humans, but handled differently by systems.
As these errors accumulate when the dataset grows, they often become a big problem for gaining trustworthy insights, so fixing them is usually a top priority in any cleaning process.
Informational note: The bigger the dataset, the more likely these small errors will accumulate and cause real damage.
2. System and integration issues
Data often flows across multiple systems, like databases, applications, APIs, each with its own rules and expectations. During this exchange, things can go wrong in subtle but significant ways. Integration issues are especially common when systems weren’t designed to work together.
Schema mismatches: One system may store dates as text strings, while another expects them in the standardized YYYY-MM-DD format. These inconsistencies can lead to parsing errors or incorrect sorting and filtering. To fix this, we’ll use
pd.to_datetime()
to convert date strings into proper datetime objects in pandas, ensuring consistency across systems.Encoding problems: Special characters can break down during transfers. A word like café might become café due to mismatched character encoding, making string operations unreliable.
Repurposed fields: Sometimes a column that once held ZIP codes is quietly reassigned to store entirely different data. Without updates to the schema or documentation, this change introduces confusion and inaccurate interpretations.
Issues like these often remain hidden until analysis begins. By then, the damage is harder to trace and more difficult to correct.
3. Automated collection noise
Even when data is collected without human input, problems still arise. Automation boosts efficiency but doesn’t guarantee quality. Whether it’s sensors, scraping tools, or OCR systems, each method introduces its own kinds of noise.
Sensor glitches: A faulty sensor might report impossible spikes, like a temperature jumping from 22°C to 2,000°C, due to hardware or calibration errors.
OCR misreads: Optical character recognition might confuse an “8” for a “B” when reading printed text, introducing errors into fields like invoice numbers or product codes.
Malformed API responses: Sometimes an API call returns incomplete data, duplicated records, or outdated information. Without proper validation, these flaws end up in the dataset.
Automated pipelines save time, but even subtle and infrequent errors they introduce can significantly distort data analysis results.
Tip: Automation saves time, but the collected data should always be verified.
4. Metadata and context gaps
Some data issues don’t lie in the values themselves, but in the lack of information about what those values mean. Lacking context leaves us guessing, and guesswork has no place in data-driven decision-making.
Missing units: A value of “100” might represent temperature, revenue, or satisfaction score. Without units or context, we risk misinterpretation.
Undefined labels: A status field ...