Navigate Data Wrangling
Explore the key stages of data wrangling including collecting, cleaning, transforming, enriching, and validating data. Understand how to handle common data issues and apply techniques like deduplication, imputation, and outlier detection in Google Sheets. This lesson helps you prepare raw, disorganized data into reliable, structured datasets ready for insightful analysis.
We'll cover the following...
As data analysts, we might wish for clean, consistent, and ready-to-explore datasets. But more often than not, we face a very different reality. Columns may be misnamed or missing, values duplicated or out of range, formats inconsistent across files, and data scattered across systems, APIs, or spreadsheets.
Before we can explore trends, create visualizations, or run any analysis, we need to make sense of the chaos in our data. This is where data wrangling comes in. It’s the essential process of turning raw, messy data into clean, structured, and usable information.
What is data wrangling?
Data wrangling is the process of cleaning, structuring, and enriching raw data into a desired format for better decision-making. It involves taking messy, unorganized, or incomplete datasets and transforming them into something usable. Whether we are removing duplicate entries, fixing incorrect values, merging datasets, or creating new calculated fields, all of it falls under the broad umbrella of data wrangling.
The goal of data wrangling isn’t just about cleaning up data for the sake of neatness. It’s about preparing data so that we can extract reliable insights, build meaningful charts, and support sound decision-making.
Data wrangling vs data cleaning: What’s the difference?
One common point of confusion is data wrangling vs data cleaning. While they’re closely related, they’re not the same.
Data cleaning
Focuses on fixing errors
Removes duplicates
Handles missing or invalid values
Standardizes formats
Data wrangling
Includes data cleaning
Plus transformation, enrichment, and validation
Prepares data end-to-end for analysis or modeling
Think of it this way: Cleaning is a step. Wrangling is the entire journey.
Why data wrangling matters in the data analysis pipeline
We can think of the data analysis pipeline as a relay race. If one stage hands off bad data to the next, the whole process falters. Without thorough data wrangling:
Our analysis might be based on incorrect assumptions.
Visualizations can become misleading or outright wrong.
The quality of our insights directly depends on the quality of our data. In fact,
In 1999, the
This wasn’t a coding error. It was a data standardization failure. Even in day-to-day analysis, this example reminds us why careful cleaning, unit alignment, and validation are critical parts of data wrangling. They don’t just clean up a dataset, they protect the decisions built on it.
Where does data wrangling fit with ETL pipelines?
In production systems, data wrangling often happens inside ETL pipelines.
ETL stands for:
Extract data from sources
Transform it into usable formats
Load it into a database or warehouse
ETL pipelines automate data wrangling at scale, especially in business intelligence and analytics systems.
Data wrangling workflow
Think of the data wrangling workflow as a reliable guide for preparing any dataset, from disorganized data to analysis-ready. Let’s break it down into a clear, step-by-step process:
Each step builds on the last, and skipping any one can cause problems later on. Let’s explore what happens at each stage and why it matters.
1. Ingest: Collecting the raw materials
Before we can clean or analyze anything, we need to gather raw data, and that’s what ingestion is all about. This step brings data together from different sources into one place, where we can begin working with it.
Note: We’ve already explored data collection in the previous chapter, so some of this may feel familiar. Here, we’re taking it a step further: bringing that collected data into our analysis environment so we can start preparing it.
Data ingestion can include:
Gathering records from structured data sources.
Loading files like CSVs or Excel sheets from local or shared storage.
Accessing data directly from web sources.
Importing semi-structured formats like XML.
Bringing data from various cloud storage locations.
The goal is to gather all relevant data, no matter the format or location, into our working environment for analysis.
Example
Let’s say we’re analyzing customer purchases. We might start by gathering order data from a central repository. Next, we download product details from a CSV file maintained by the inventory team. To get the latest prices, we access live pricing data from an online source. Once we’ve gathered all three sources, we combine them into a single coherent dataset. Only after this ingestion step can we begin to clean, merge, and make sense of the data.
Some companies set up automated data ingestion pipelines that pull, clean, and store fresh data every few minutes, like a self-cleaning fridge that restocks itself with clean groceries.
Challenges during ingestion
While ingestion may sound straightforward, it often comes with technical and logistical hurdles that we need to plan for.
Different formats: Data might come from structured sources, CSV files, or web-based data feeds, each needing its own handling and alignment.
Access issues: Some sources require credentials, logins, or specific permissions to connect and fetch data.
Big data loads: Massive datasets can overwhelm memory or crash tools. Processing data in smaller chunks can help manage it safely.
Real-time delays: Live data (like from sensors or web feeds) may be incomplete, late, or out of order, requiring careful handling and timestamp syncing.
2. Clean: Making the data usable
Cleaning is the step where we make the raw, ingested data trustworthy. This means correcting errors, removing duplicates, handling missing values, and fixing formatting issues. Without clean data, even the best analysis can give misleading or incorrect results.
In this step, we focus on:
Filling in missing values (e.g., blanks, placeholders).
Removing or flagging duplicates.
Standardizing formats (e.g., dates, capitalization, currency).
Fixing invalid or out-of-range values.
The goal is to make sure the data is consistent, complete, and accurate, so it won’t introduce bias or confusion later in the analysis workflow.
Example
Imagine we’re working with a customer transaction dataset. During the cleaning phase, we might start by filling missing values in the CustomerAge column using the median age. Next, we drop duplicate rows where the same TransactionID appears more than once. We then convert all purchase_date entries into a standard date format to ensure consistency. Finally, we remove any records where the total amount is negative, since these likely indicate data entry errors. By the end of this step, our dataset is no longer messy; it’s structured, reliable, and ready for transformation.
Some data cleaning approaches offer profiling features that automatically highlight common issues, like columns with too many missing values or inconsistent formats. It’s like spell-check, but for our data.
Challenges in cleaning
Cleaning data may seem like a simple fix-up job, but it often involves tricky decisions and hidden problems that can affect downstream analysis.
Tracing the source of errors: Missing or unusual values may stem from user mistakes, system bugs, or data corruption. Without context, it’s hard to choose the right fix.
Deciding what to remove or fill: Dropping rows can lead to data loss, while filling values (e.g., with averages) can introduce bias if not done thoughtfully.
Standardizing inconsistent formats: Fields like dates or country names may appear in multiple forms (e.g., “USA,” “U.S.A.,” “United States”) and must be aligned carefully.
Handling rare or complex cases: Rules that work for most data may fail on edge cases, like names with special characters, leading to errors downstream.
3. Transform: Structuring for insight
Once the data is clean, we move on to shaping it. This means transforming the dataset so it’s aligned with the goals of our analysis. This step is about changing the structure or content of the data to make it more useful, insightful, or compatible with analytical tools.
Transforming data may involve:
Changing data types (e.g., text to dates or numbers).
Normalizing or scaling numeric columns.
Creating new columns (like
profit = revenue - cost).Aggregating data (e.g., monthly sales totals).
Reshaping data (e.g.,
).pivoting tables A pivot table is a data summarization tool used to analyze large datasets by reorganizing and summarizing data into a more manageable format.
The goal is to organize and enhance the dataset so it supports meaningful exploration and visualizations.
Example
Let’s say we’re analyzing weekly sales trends. After cleaning the raw transaction data, we might begin by converting the order_date column into a proper date format. From there, we extract the week number or month to help us organize the data over time. Next, we group sales by product_id and week to calculate total weekly sales. Finally, we normalize the sales_amount to a
Challenges in transformation
Transformation is powerful, but it can backfire if done carelessly. Small mistakes here can quietly distort the results or make the data harder to work with.
Order matters: If we aggregate before cleaning, we might include duplicate or invalid entries in totals.
Scaling problems: Applying transformations like scaling without checking for zero or negative values can break the analysis.
High dimensionality: Creating too many features or columns can make analysis complex or slow down processing.
Interpretability: Over-engineering features can make the analysis harder to explain or trust.
4. Enrich: Expanding context
Once the data is clean and structured, we can enrich it by adding external data or derived insights that give our analysis more depth and context. This step helps us move from isolated numbers to real-world meaning.
Data enrichment can include:
Merging in third-party data (e.g., weather, demographics, pricing benchmarks).
Mapping locations to regions or postal codes.
Adding calculated fields (e.g., customer lifetime value).
Joining internal datasets (e.g., product information with sales records).
The goal is to enhance the dataset with new layers of understanding that reveal patterns we couldn’t see before.
Many companies bring in external datasets (like foot traffic, market trends, or competitor prices) to enrich their internal data. When done right, this can give them a powerful competitive edge.
Example
Suppose we have a dataset of retail transactions. To enrich it, we could join it with weather data to see if rainy days affect foot traffic. We might also map store locations to broader regions for regional sales analysis. Adding public holiday data could help us flag special sale periods, while bringing in average income by postal code could uncover patterns in local spending behavior. Now our data doesn’t just tell us what happened; it starts to explain why.
Challenges in enrichment
Enrichment can unlock deeper insights, but blending data from multiple sources introduces complexity and risks that must be managed carefully.
Inconsistent keys: Datasets may not share common identifiers (e.g., one uses full postal codes, another uses truncated ones).
Data freshness: External data may be outdated or on a different update cycle.
Compliance risks: Adding personal or sensitive data, like health or financial information, can raise legal and privacy concerns.
Error propagation: If the data we’re enriching is flawed, we may introduce new inaccuracies into our clean dataset.
5. Validate: Trust but verify
After ingesting, cleaning, transforming, and enriching the data, we’re almost ready to analyze, but first, we need to validate it. Validation ensures the data is accurate, consistent, and ready for decision-making. Think of it as a final quality check before insights go live.
Validation involves:
Checking totals and summaries (e.g., total sales match reports).
Verifying that values fall within logical ranges (e.g., no negative ages).
Ensuring formats and types are correct.
Spotting anomalies or inconsistencies that slipped through earlier steps.
This step helps us catch silent errors, prevent misleading results, and build trust in our findings.
Example
Suppose we’re working with employee HR data. Before jumping into analysis, we might first confirm that each hire_date comes before the corresponding termination_date. We’d also check for outliers, like salaries of $0 or anything over $1 million, that could indicate errors. To ensure consistency, we might cross-verify total headcounts with another system or internal report. Finally, we’d make sure that no departments are missing from the dataset, so all areas of the organization are accurately represented. Even well-cleaned data can have logical inconsistencies, and validation helps surface them.
Some teams use automated checks to validate datasets. It’s like writing test cases for data.
Challenges in validation
Validation may be the final step, but it’s where silent errors, missed by every other process, are revealed. Overlooking it can undermine the integrity of the entire analysis.
Ambiguity in business rules: Sometimes, there is no clear threshold or “golden rule” for what constitutes valid data. What is a reasonable value for employee attrition rate may differ by industry and context.
Silent failures: A dataset may pass basic checks (e.g., all fields are filled) yet still contain logical flaws (e.g., a delivery date before the order date).
Automating robust validation: Writing generalizable and reusable validation rules that work across different datasets and use cases is hard. It often requires thorough logic.
But here’s the catch: validation often exposes just how disorganized our data is, revealing issues like outdated records and hidden inconsistencies. Before we can clean or validate effectively, we need to face a core truth: most data appears to be unstructured in the beginning. From human errors to system glitches, real-world data rarely arrives in perfect shape.
In the next section, we’ll explore the common sources of dirty data and how to fix them, so we can clean with purpose and avoid costly mistakes.
What makes data dirty?
Dirty data isn’t just an abstract concept; it refers to information that fails to accurately or consistently reflect reality. Understanding the sources of dirty data is the first step toward addressing them.
Source of Dirty Data | What It Means | Common Examples of Issues |
Human entry errors | Mistakes during manual data input. | Typos (“Californa”), transposed numbers (12,000 vs 21,000), inconsistent labels (“NY” vs. “New York”). |
System and integration Issues | Problems when data flows between different systems or applications. | Date format mismatches, broken characters (café), repurposed fields storing wrong data. |
Automated collection noise | Errors from non-human data gathering (sensors, web tools). | Faulty sensor spikes, optical recognition misreads (“8” for “B”), incomplete web responses. |
Metadata and context gaps | Missing information about what data values represent. | Undefined units (“100” without knowing if it’s temperature or revenue), meaningless code labels (“1” vs “0” status). |
Dimensions of data quality issues
We’ve explored the common ways data assumes a disorganized state, whether due to human error, system mismatches, automation glitches, or missing context. Recognizing these sources is important, but to clean data effectively, we need a structured way to diagnose specific problems.
That’s where data quality dimensions come in.
These dimensions help us evaluate data systematically, making it easier to identify issues and apply the right cleaning techniques. Let’s walk through the key ones we’ll encounter:
Dimension | Key Question | Examples |
Completeness | Is all required data present? | Empty fields, missing rows after merges |
Accuracy | Does the data reflect real-world values? | Impossible ages (e.g., -4), logical conflicts, misspellings |
Noise and outliers | Are there random errors or extreme values? | Sensor spikes, garbled text, extreme highs/lows |
Consistency | Is data uniformly formatted and labeled? | Date format variations, mixed units, inconsistent spellings |
Uniqueness | Does each record represent a unique entity? | Duplicate rows, slightly varied records for the same person |
Timeliness | Is the data current and relevant? | Outdated customer info, delayed reports |
Validity and integrity | Does the data meet expected formats and rules? | Text in number fields, negative quantities, invalid characters |
Understanding these dimensions allows us to diagnose the specific problems affecting our data more accurately. By clearly identifying the type of issue, whether it’s missing values, duplicates, or inconsistencies, we can choose the most effective cleaning techniques.
Core data-cleaning techniques
Once we’ve identified the problems in our data, the next step is knowing how to fix them. Data cleaning isn’t a one-size-fits-all task; it’s a toolkit of techniques we choose based on the type of issue we’re facing.
1. Parsing and standardization
This technique involves breaking down complex or combined data fields into smaller parts and converting different formats into a consistent style. It ensures that data from diverse sources can be compared, and analyzed reliably; it also handles consistency and validity issues.
Example: Suppose we have a list of phone numbers from different countries and formats: +1(555)123-4567, 555-123-4567, 001-555-123-4567, (555)1234567. Without standardizing, these look different and can’t be easily searched or grouped. Parsing breaks each number into parts (country code, area code, local number), and standardization converts all numbers into the same format, like +1-555-123-4567. This makes it easy to compare, sort, and analyze phone numbers consistently.
2. Deduplication
Deduplication is the process of finding and merging records that refer to the same entity, but appear multiple times with slight differences. It helps reduce redundancy, prevents inflated counts, and ensures the uniqueness of the data.
Example: Suppose we have a customer list where the same person appears multiple times but with slightly different spellings or details: “Jon Smith, jon.smith@email.com,” “John Smith, john.smith@email.com,” “J. Smith,” “jon.smith@email.com.” These records look different but actually refer to the same customer. Deduplication uses techniques like
3. Imputation
Imputation fills in missing data values using statistical measures like mean or median, or predictive techniques based on other available data. This maintains dataset completeness while minimizing bias introduced by missing values.
Example: In an e-learning platform’s dataset, if some student records are missing quiz scores, we might fill those missing values with the average quiz score of students with similar progress or activity levels. This way, the dataset stays complete without dropping those records entirely.
4. Outlier detection and treatment
Outlier detection identifies data points far from normal ranges using methods like
Example: Temperature sensors might report spikes at 1000°C due to hardware glitches. Detecting these outliers through statistical methods like interquartile range (IQR) helps decide whether to exclude, or adjust these erroneous values.
5. Validation rules and constraints
Validation rules and constraints enforce predefined business logic and data formats to ensure that only correct and sensible data enters the system. This process helps catch errors early, maintaining the reliability and integrity of the dataset.
Example: In an e-commerce data, validation rules can block orders with negative quantities or discounts exceeding 100%. By enforcing these constraints at the data entry or system level, we prevent invalid transactions from distorting sales analytics and financial reports.
6. Text-specific cleanup
When working with open-ended fields like customer feedback, product descriptions, or survey comments, text cleanup is essential. This process removes noise (like HTML tags or excessive punctuation), standardizes formatting, and prepares the data for meaningful analysis. It helps ensure that insights drawn from the text are consistent, and accurate.
Example: Customer reviews like “Great product!!!” and “great product.” may look similar to us, but raw text analysis would treat them differently. By removing punctuation and shifting to lowercase, we turn them into a consistent format, making it easier to count keywords, group sentiments, or compare feedback across sources.
7. Automation and monitoring
Automation scripts apply routine cleaning tasks such as deduplication, missing value checks, and outlier detection to large or frequently updated datasets. Monitoring tools generate alerts when issues arise, allowing teams to address problems promptly without manual effort.
Example: A bank automates daily transaction data cleaning and sets alerts for duplicate entries or missing files. When an alert is triggered, the data team investigates quickly, ensuring financial reports remain accurate and timely.
Below, we map data cleaning techniques to provide a clear view of which methods address each data-quality dimension.
Data-quality issues and techniques
Data-Quality Dimension | Core Cleaning Techniques |
Completeness | Imputation |
Accuracy | Validation rules, outlier detection and treatment |
Consistency | Parsing and standardization |
Uniqueness | Deduplication |
Timeliness | Automation and monitoring |
Validity and integrity | Validation rules and constraints |
Noise and outliers | Outlier detection and treatment, text-specific cleanup |
Wrap up
In conclusion, data wrangling is a vital step in turning raw, disorganized data into a reliable foundation for analysis. By understanding common data issues and applying effective cleaning techniques, we can greatly improve data quality and trust. While it can be time-consuming, this effort ensures that insights and decisions are based on accurate, consistent information. This leads to clearer reports and more confident decision-making.
Up next, we’ll explore key Google Sheets functions to clean, validate, and shape your data. Ready to turn disorganized data into actionable insights? Let’s begin.
Technical Quiz
Which stage of the data wrangling workflow involves bringing data from various disparate locations into a unified working environment?
Cleaning
Transformation
Validation
Ingestion