...

/

Why Data Needs a Clean Up

Why Data Needs a Clean Up

Learn why cleaning data is critical and how to identify common data issues.

Data collected from real-world sources is seldom neat or ready to use. It usually comes messy, with missing pieces, mixed-up labels, weird formats, and duplicates. It’s like trying to read a book with pages out of order and some words smudged—we need to clean it up before it makes sense.

Cleaning data isn’t just about correcting errors. It’s about transforming raw, unreliable data into a trustworthy foundation that can drive meaningful insights and decisions.

Press + to interact

Data professionals report spending 60 to 80% of their time cleaning and preparing data. While it may not be the most exciting part of data science, it’s essential because even minor errors can compromise the integrity of an entire analysis.

This lesson will explore the main sources that make data dirty, the key dimensions of data quality issues we encounter, and why effective cleaning is crucial for reliable analysis.

What makes data dirty?

Dirty data isn’t just an abstract concept—it refers to information that fails to accurately or consistently reflect reality. For data scientists, this is more than just a technical issue. Poor-quality data undermines the integrity of our analysis, misguides models, and leads to decisions that lack a solid foundation.

To address data quality issues effectively, it’s crucial to identify their origins. Understanding the sources of dirty data is the first step toward resolving it. Let’s explore the different sources of dirty data.

1. Human entry errors

Many datasets originate from manual inputs, such as web forms, spreadsheets, CRMs, or handwritten records. These entry points are highly susceptible to small but impactful mistakes.

  • Misspellings and typos: A typo like “Californa” instead of “California” can prevent accurate grouping or aggregation, leading to flawed summaries.

  • Transposed numbers: Typing 12,000 instead of 21,000 changes the meaning of the data entirely, distorting totals and misleading any analysis based on those figures.

  • Inconsistent labeling: Values such as “NY,” “N.Y.,” and “New York” are interpreted the same by humans but handled differently by systems.

As these errors accumulate when the dataset grows, they often become a big problem for gaining trustworthy insights, so fixing them is usually a top priority in any cleaning process.

2. System and integration issues

Data often flows across multiple systems—databases, applications, APIs—each with rules and expectations. During this exchange, things can go wrong in subtle but significant ways. Integration issues are especially common when systems weren’t designed to work together.

  • Schema mismatches: One system may store dates as text strings, while another expects them in the standardized YYYY-MM-DD format. These inconsistencies lead to parsing errors or incorrect sorting and filtering.

  • Encoding problems: Special characters can break down during transfers. A word like café might become café due to mismatched character encoding, making string operations unreliable.

  • Repurposed fields: Sometimes a column that once held ZIP codes is quietly reassigned to store entirely different data. Without updates to the schema or documentation, this change introduces confusion and inaccurate interpretations.

Issues like these often remain hidden until analysis begins. By then, the damage is harder to trace and more difficult to correct.

3. Automated collection of noise

Even when data is collected without human input, problems still arise. Automation boosts efficiency but doesn’t guarantee quality. Each method introduces noise, whether it’s sensors, scraping tools, or OCR systems.

  • Sensor glitches: A faulty sensor might report impossible spikes—like a temperature jumping from 22°C to 2,000°C—due to hardware or calibration errors.

  • OCR misreads: Optical character recognition might confuse an “8” for a “B” when reading printed text, introducing errors into fields like invoice numbers or product codes.

  • Malformed API responses: Sometimes an API call returns incomplete data, duplicated records, or outdated information. Without proper validation, these flaws end up in your dataset.

Automated pipelines save time, but even subtle and infrequent errors they introduce can significantly distort data analysis results.

4. Metadata and context gaps

Some data issues don’t lie in the values themselves, but in the lack of information about what those ...