Search⌘ K
AI Features

Navigate Data Wrangling

Explore the key stages of data wrangling including collecting, cleaning, transforming, enriching, and validating data. Understand how to handle common data issues and apply techniques like deduplication, imputation, and outlier detection in Google Sheets. This lesson helps you prepare raw, disorganized data into reliable, structured datasets ready for insightful analysis.

As data analysts, we might wish for clean, consistent, and ready-to-explore datasets. But more often than not, we face a very different reality. Columns may be misnamed or missing, values duplicated or out of range, formats inconsistent across files, and data scattered across systems, APIs, or spreadsheets.

Before we can explore trends, create visualizations, or run any analysis, we need to make sense of the chaos in our data. This is where data wrangling comes in. It’s the essential process of turning raw, messy data into clean, structured, and usable information.

What is data wrangling?

Data wrangling is the process of cleaning, structuring, and enriching raw data into a desired format for better decision-making. It involves taking messy, unorganized, or incomplete datasets and transforming them into something usable. Whether we are removing duplicate entries, fixing incorrect values, merging datasets, or creating new calculated fields, all of it falls under the broad umbrella of data wrangling.

The goal of data wrangling isn’t just about cleaning up data for the sake of neatness. It’s about preparing data so that we can extract reliable insights, build meaningful charts, and support sound decision-making.

Data wrangling vs data cleaning: What’s the difference?

One common point of confusion is data wrangling vs data cleaning. While they’re closely related, they’re not the same.

Data cleaning

  • Focuses on fixing errors

  • Removes duplicates

  • Handles missing or invalid values

  • Standardizes formats

Data wrangling

  • Includes data cleaning

  • Plus transformation, enrichment, and validation

  • Prepares data end-to-end for analysis or modeling

Think of it this way: Cleaning is a step. Wrangling is the entire journey.

Why data wrangling matters in the data analysis pipeline

We can think of the data analysis pipeline as a relay race. If one stage hands off bad data to the next, the whole process falters. Without thorough data wrangling:

  • Our analysis might be based on incorrect assumptions.

  • Visualizations can become misleading or outright wrong.

The quality of our insights directly depends on the quality of our data. In fact, studieshttps://www.forbes.com/sites/gilpress/2016/03/23/data-preparation-most-time-consuming-least-enjoyable-data-science-task-survey-says/ suggest that up to 80% of a data analyst’s time is spent wrangling data before any analysis even begins. Poor data quality doesn’t just waste time, it can have costly consequences.

In 1999, the NASA Mars Climate Orbiterhttps://everydayastronaut.com/mars-climate-orbiter/ crashed because of a failure to convert units between metric and imperial systems. The mission was lost when the spacecraft entered the Martian atmosphere at the wrong angle, costing NASA approximately $327 million.

This wasn’t a coding error. It was a data standardization failure. Even in day-to-day analysis, this example reminds us why careful cleaning, unit alignment, and validation are critical parts of data wrangling. They don’t just clean up a dataset, they protect the decisions built on it.

Where does data wrangling fit with ETL pipelines?

In production systems, data wrangling often happens inside ETL pipelines.

ETL stands for:

  • Extract data from sources

  • Transform it into usable formats

  • Load it into a database or warehouse

ETL pipelines automate data wrangling at scale, especially in business intelligence and analytics systems.

Data wrangling workflow

Think of the data wrangling workflow as a reliable guide for preparing any dataset, from disorganized data to analysis-ready. Let’s break it down into a clear, step-by-step process:

Data wrangling’s key processes
Data wrangling’s key processes

Each step builds on the last, and skipping any one can cause problems later on. Let’s explore what happens at each stage and why it matters.

1. Ingest: Collecting the raw materials

Before we can clean or analyze anything, we need to gather raw data, and that’s what ingestion is all about. This step brings data together from different sources into one place, where we can begin working with it.

Note: We’ve already explored data collection in the previous chapter, so some of this may feel familiar. Here, we’re taking it a step further: bringing that collected data into our analysis environment so we can start preparing it.

Data ingestion can include:

  • Gathering records from structured data sources.

  • Loading files like CSVs or Excel sheets from local or shared storage.

  • Accessing data directly from web sources.

  • Importing semi-structured formats like XML.

  • Bringing data from various cloud storage locations.

The goal is to gather all relevant data, no matter the format or location, into our working environment for analysis.

Example

Let’s say we’re analyzing customer purchases. We might start by gathering order data from a central repository. Next, we download product details from a CSV file maintained by the inventory team. To get the latest prices, we access live pricing data from an online source. Once we’ve gathered all three sources, we combine them into a single coherent dataset. Only after this ingestion step can we begin to clean, merge, and make sense of the data.

Some companies set up automated data ingestion pipelines that pull, clean, and store fresh data every few minutes, like a self-cleaning fridge that restocks itself with clean groceries.

Challenges during ingestion

While ingestion may sound straightforward, it often comes with technical and logistical hurdles that we need to plan for.

  • Different formats: Data might come from structured sources, CSV files, or web-based data feeds, each needing its own handling and alignment.

  • Access issues: Some sources require credentials, logins, or specific permissions to connect and fetch data.

  • Big data loads: Massive datasets can overwhelm memory or crash tools. Processing data in smaller chunks can help manage it safely.

  • Real-time delays: Live data (like from sensors or web feeds) may be incomplete, late, or out of order, requiring careful handling and timestamp syncing.

2. Clean: Making the data usable

Cleaning is the step where we make the raw, ingested data trustworthy. This means correcting errors, removing duplicates, handling missing values, and fixing formatting issues. Without clean data, even the best analysis can give misleading or incorrect results.

In this step, we focus on:

  • Filling in missing values (e.g., blanks, placeholders).

  • Removing or flagging duplicates.

  • Standardizing formats (e.g., dates, capitalization, currency).

  • Fixing invalid or out-of-range values.

The goal is to make sure the data is consistent, complete, and accurate, so it won’t introduce bias or confusion later in the analysis workflow.

Example

Imagine we’re working with a customer transaction dataset. During the cleaning phase, we might start by filling missing values in the CustomerAge column using the median age. Next, we drop duplicate rows where the same TransactionID appears more than once. We then convert all purchase_date entries into a standard date format to ensure consistency. Finally, we remove any records where the total amount is negative, since these likely indicate data entry errors. By the end of this step, our dataset is no longer messy; it’s structured, reliable, and ready for transformation.

Some data cleaning approaches offer profiling features that automatically highlight common issues, like columns with too many missing values or inconsistent formats. It’s like spell-check, but for our data.

Challenges in cleaning

Cleaning data may seem like a simple fix-up job, but it often involves tricky decisions and hidden problems that can affect downstream analysis.

  • Tracing the source of errors: Missing or unusual values may stem from user mistakes, system bugs, or data corruption. Without context, it’s hard to choose the right fix.

  • Deciding what to remove or fill: Dropping rows can lead to data loss, while filling values (e.g., with averages) can introduce bias if not done thoughtfully.

  • Standardizing inconsistent formats: Fields like dates or country names may appear in multiple forms (e.g., “USA,” “U.S.A.,” “United States”) and must be aligned carefully.

  • Handling rare or complex cases: Rules that work for most data may fail on edge cases, like names with special characters, leading to errors downstream.

3. Transform: Structuring for insight

Once the data is clean, we move on to shaping it. This means transforming the dataset so it’s aligned with the goals of our analysis. This step is about changing the structure or content of the data to make it more useful, insightful, or compatible with analytical tools.

Transforming data may involve:

  • Changing data types (e.g., text to dates or numbers).

  • Normalizing or scaling numeric columns.

  • Creating new columns (like profit = revenue - cost).

  • Aggregating data (e.g., monthly sales totals).

  • Reshaping data (e.g., pivoting tablesA pivot table is a data summarization tool used to analyze large datasets by reorganizing and summarizing data into a more manageable format.).

The goal is to organize and enhance the dataset so it supports meaningful exploration and visualizations.

Example

Let’s say we’re analyzing weekly sales trends. After cleaning the raw transaction data, we might begin by converting the order_date column into a proper date format. From there, we extract the week number or month to help us organize the data over time. Next, we group sales by product_id and week to calculate total weekly sales. Finally, we normalize the sales_amount to a 010–1 scale to make it easier to compare across products and time periods during analysis. These transformations help us go from raw rows to patterns, and comparisons that matter.

Challenges in transformation

Transformation is powerful, but it can backfire if done carelessly. Small mistakes here can quietly distort the results or make the data harder to work with.

  • Order matters: If we aggregate before cleaning, we might include duplicate or invalid entries in totals.

  • Scaling problems: Applying transformations like scaling without checking for zero or negative values can break the analysis.

  • High dimensionality: Creating too many features or columns can make analysis complex or slow down processing.

  • Interpretability: Over-engineering features can make the analysis harder to explain or trust.

4. Enrich: Expanding context

Once the data is clean and structured, we can enrich it by adding external data or derived insights that give our analysis more depth and context. This step helps us move from isolated numbers to real-world meaning.

Data enrichment can include:

  • Merging in third-party data (e.g., weather, demographics, pricing benchmarks).

  • Mapping locations to regions or postal codes.

  • Adding calculated fields (e.g., customer lifetime value).

  • Joining internal datasets (e.g., product information with sales records).

The goal is to enhance the dataset with new layers of understanding that reveal patterns we couldn’t see before.

Many companies bring in external datasets (like foot traffic, market trends, or competitor prices) to enrich their internal data. When done right, this can give them a powerful competitive edge.

Example

Suppose we have a dataset of retail transactions. To enrich it, we could join it with weather data to see if rainy days affect foot traffic. We might also map store locations to broader regions for regional sales analysis. Adding public holiday data could help us flag special sale periods, while bringing in average income by postal code could uncover patterns in local spending behavior. Now our data doesn’t just tell us what happened; it starts to explain why.

Challenges in enrichment

Enrichment can unlock deeper insights, but blending data from multiple sources introduces complexity and risks that must be managed carefully.

  • Inconsistent keys: Datasets may not share common identifiers (e.g., one uses full postal codes, another uses truncated ones).

  • Data freshness: External data may be outdated or on a different update cycle.

  • Compliance risks: Adding personal or sensitive data, like health or financial information, can raise legal and privacy concerns.

  • Error propagation: If the data we’re enriching is flawed, we may introduce new inaccuracies into our clean dataset.

5. Validate: Trust but verify

After ingesting, cleaning, transforming, and enriching the data, we’re almost ready to analyze, but first, we need to validate it. Validation ensures the data is accurate, consistent, and ready for decision-making. Think of it as a final quality check before insights go live.

Validation involves:

  • Checking totals and summaries (e.g., total sales match reports).

  • Verifying that values fall within logical ranges (e.g., no negative ages).

  • Ensuring formats and types are correct.

  • Spotting anomalies or inconsistencies that slipped through earlier steps.

This step helps us catch silent errors, prevent misleading results, and build trust in our findings.

Example

Suppose we’re working with employee HR data. Before jumping into analysis, we might first confirm that each hire_date comes before the corresponding termination_date. We’d also check for outliers, like salaries of $0 or anything over $1 million, that could indicate errors. To ensure consistency, we might cross-verify total headcounts with another system or internal report. Finally, we’d make sure that no departments are missing from the dataset, so all areas of the organization are accurately represented. Even well-cleaned data can have logical inconsistencies, and validation helps surface them.

Some teams use automated checks to validate datasets. It’s like writing test cases for data.

Challenges in validation

Validation may be the final step, but it’s where silent errors, missed by every other process, are revealed. Overlooking it can undermine the integrity of the entire analysis.

  • Ambiguity in business rules: Sometimes, there is no clear threshold or “golden rule” for what constitutes valid data. What is a reasonable value for employee attrition rate may differ by industry and context.

  • Silent failures: A dataset may pass basic checks (e.g., all fields are filled) yet still contain logical flaws (e.g., a delivery date before the order date).

  • Automating robust validation: Writing generalizable and reusable validation rules that work across different datasets and use cases is hard. It often requires thorough logic.

But here’s the catch: validation often exposes just how disorganized our data is, revealing issues like outdated records and hidden inconsistencies. Before we can clean or validate effectively, we need to face a core truth: most data appears to be unstructured in the beginning. From human errors to system glitches, real-world data rarely arrives in perfect shape.

In the next section, we’ll explore the common sources of dirty data and how to fix them, so we can clean with purpose and avoid costly mistakes.

What makes data dirty?

Dirty data isn’t just an abstract concept; it refers to information that fails to accurately or consistently reflect reality. Understanding the sources of dirty data is the first step toward addressing them.

Source of Dirty Data

What It Means

Common Examples of Issues

Human entry errors

Mistakes during manual data input.

Typos (“Californa”), transposed numbers (12,000 vs 21,000), inconsistent labels (“NY” vs. “New York”).

System and integration Issues

Problems when data flows between different systems or applications.

Date format mismatches, broken characters (café), repurposed fields storing wrong data.

Automated collection noise

Errors from non-human data gathering (sensors, web tools).

Faulty sensor spikes, optical recognition misreads (“8” for “B”), incomplete web responses.

Metadata and context gaps

Missing information about what data values represent.

Undefined units (“100” without knowing if it’s temperature or revenue), meaningless code labels (“1” vs “0” status).

Dimensions of data quality issues

We’ve explored the common ways data assumes a disorganized state, whether due to human error, system mismatches, automation glitches, or missing context. Recognizing these sources is important, but to clean data effectively, we need a structured way to diagnose specific problems.

That’s where data quality dimensions come in.

These dimensions help us evaluate data systematically, making it easier to identify issues and apply the right cleaning techniques. Let’s walk through the key ones we’ll encounter:

Dimension

Key Question

Examples

Completeness

Is all required data present?

Empty fields, missing rows after merges

Accuracy

Does the data reflect real-world values?

Impossible ages (e.g., -4), logical conflicts, misspellings

Noise and outliers

Are there random errors or extreme values?

Sensor spikes, garbled text, extreme highs/lows

Consistency

Is data uniformly formatted and labeled?

Date format variations, mixed units, inconsistent spellings

Uniqueness

Does each record represent a unique entity?

Duplicate rows, slightly varied records for the same person

Timeliness

Is the data current and relevant?

Outdated customer info, delayed reports

Validity and integrity

Does the data meet expected formats and rules?

Text in number fields, negative quantities, invalid characters

Understanding these dimensions allows us to diagnose the specific problems affecting our data more accurately. By clearly identifying the type of issue, whether it’s missing values, duplicates, or inconsistencies, we can choose the most effective cleaning techniques.

Core data-cleaning techniques

Once we’ve identified the problems in our data, the next step is knowing how to fix them. Data cleaning isn’t a one-size-fits-all task; it’s a toolkit of techniques we choose based on the type of issue we’re facing.

1. Parsing and standardization

This technique involves breaking down complex or combined data fields into smaller parts and converting different formats into a consistent style. It ensures that data from diverse sources can be compared, and analyzed reliably; it also handles consistency and validity issues.

Example: Suppose we have a list of phone numbers from different countries and formats: +1(555)123-4567, 555-123-4567, 001-555-123-4567, (555)1234567. Without standardizing, these look different and can’t be easily searched or grouped. Parsing breaks each number into parts (country code, area code, local number), and standardization converts all numbers into the same format, like +1-555-123-4567. This makes it easy to compare, sort, and analyze phone numbers consistently.

2. Deduplication

Deduplication is the process of finding and merging records that refer to the same entity, but appear multiple times with slight differences. It helps reduce redundancy, prevents inflated counts, and ensures the uniqueness of the data.

Example: Suppose we have a customer list where the same person appears multiple times but with slightly different spellings or details: “Jon Smith, jon.smith@email.com,” “John Smith, john.smith@email.com,” “J. Smith,” “jon.smith@email.com.” These records look different but actually refer to the same customer. Deduplication uses techniques like fuzzy matchingFuzzy matching in Google Sheets helps identify and match similar but not identical text entries (e.g., “Jon” vs. “John”). It can be done using add-ons like Fuzzy Match or custom formulas with functions like INDEX, MATCH, and REGEXMATCH. to identify, and merge these duplicate entries into a single, accurate record. This process helps keep customer counts correct, and avoids overestimating metrics like sales or engagement.

3. Imputation

Imputation fills in missing data values using statistical measures like mean or median, or predictive techniques based on other available data. This maintains dataset completeness while minimizing bias introduced by missing values.

Example: In an e-learning platform’s dataset, if some student records are missing quiz scores, we might fill those missing values with the average quiz score of students with similar progress or activity levels. This way, the dataset stays complete without dropping those records entirely.

4. Outlier detection and treatment

Outlier detection identifies data points far from normal ranges using methods like IQRShows the range where the middle half of the data lies and helps find outliers., z-scoreTells how far a value is from the average, using standard units., and boxplotsA simple chart that shows the spread of data and highlights any unusual values.. These points can then be corrected, removed, or flagged for review. Treatments include removing or capping extreme values to reduce noise. This helps ensure more accurate and reliable analysis results.

Example: Temperature sensors might report spikes at 1000°C due to hardware glitches. Detecting these outliers through statistical methods like interquartile range (IQR) helps decide whether to exclude, or adjust these erroneous values.

5. Validation rules and constraints

Validation rules and constraints enforce predefined business logic and data formats to ensure that only correct and sensible data enters the system. This process helps catch errors early, maintaining the reliability and integrity of the dataset.

Example: In an e-commerce data, validation rules can block orders with negative quantities or discounts exceeding 100%. By enforcing these constraints at the data entry or system level, we prevent invalid transactions from distorting sales analytics and financial reports.

6. Text-specific cleanup

When working with open-ended fields like customer feedback, product descriptions, or survey comments, text cleanup is essential. This process removes noise (like HTML tags or excessive punctuation), standardizes formatting, and prepares the data for meaningful analysis. It helps ensure that insights drawn from the text are consistent, and accurate.

Example: Customer reviews like “Great product!!!” and “great product.” may look similar to us, but raw text analysis would treat them differently. By removing punctuation and shifting to lowercase, we turn them into a consistent format, making it easier to count keywords, group sentiments, or compare feedback across sources.

7. Automation and monitoring

Automation scripts apply routine cleaning tasks such as deduplication, missing value checks, and outlier detection to large or frequently updated datasets. Monitoring tools generate alerts when issues arise, allowing teams to address problems promptly without manual effort.

Example: A bank automates daily transaction data cleaning and sets alerts for duplicate entries or missing files. When an alert is triggered, the data team investigates quickly, ensuring financial reports remain accurate and timely.

Below, we map data cleaning techniques to provide a clear view of which methods address each data-quality dimension.

Data-quality issues and techniques

Data-Quality Dimension

Core Cleaning Techniques

Completeness

Imputation

Accuracy

Validation rules, outlier detection and treatment

Consistency

Parsing and standardization

Uniqueness

Deduplication

Timeliness

Automation and monitoring

Validity and integrity

Validation rules and constraints

Noise and outliers

Outlier detection and treatment, text-specific cleanup

Wrap up

In conclusion, data wrangling is a vital step in turning raw, disorganized data into a reliable foundation for analysis. By understanding common data issues and applying effective cleaning techniques, we can greatly improve data quality and trust. While it can be time-consuming, this effort ensures that insights and decisions are based on accurate, consistent information. This leads to clearer reports and more confident decision-making.

Up next, we’ll explore key Google Sheets functions to clean, validate, and shape your data. Ready to turn disorganized data into actionable insights? Let’s begin.

Technical Quiz

1.

Which stage of the data wrangling workflow involves bringing data from various disparate locations into a unified working environment?

A.

Cleaning

B.

Transformation

C.

Validation

D.

Ingestion


1 / 5