...

/

Data Wrangling

Data Wrangling

Get introduced to data wrangling in Python and its examples, its techniques, and how it compares to other related concepts.

Introduction

Data wrangling, also called data cleaning, data munging, or data transformation, is transforming data from its raw format into a meaningful format that can be used for further analysis, such as data visualization, data analysis, and machine learning.

Here are some examples of data wrangling:

  • Finding and removing syntax errors in data

  • Finding and handling missing values

  • Finding and handling outliers

  • Removing irrelevant data

  • Merging or splitting columns

Press + to interact
The data wrangling process
The data wrangling process

Data wrangling vs. other related concepts

Let’s see the difference between data wrangling and data mining, preprocessing, data analysis, and data modeling.

Comparison of Different Data Concepts

Concept

Explanation

Data wrangling

Refers to transforming data from a raw format into a meaningful format for further analysis.

Data cleaning

One of the many steps undertaken during data wrangling.

Data mining

Refers to the general process of finding and extracting meaningful patterns from large datasets using various algorithms and techniques. We perform data wrangling techniques during the data preparation step of data mining.

Data visualization

Refers to the process of representing data using visual elements, such as charts. Before performing data visualization, we prepare data through data wrangling.

Data analysis

Refers to the process of applying statistical or logical techniques to describe, illustrate, and evaluate data. Data wrangling is a prerequisite step in data analysis.

Machine learning

Refers to the process of building systems that make predictions using insights from data. We prepare data for training machine learning models using data wrangling techniques.

Data wrangling in the context of machine learning

When working on machine learning problems, we usually use data wrangling techniques to prepare data for consumption by machine learning models. If the data in question is appropriate, then the model can make predictions accurately. If not, then the model tends to make mistakes during prediction.

This process of data preparation for model consumption is called feature engineering. During this process, new data can be created from existing data, while irrelevant data that doesn't contribute to model accuracy can be discarded.

Furthermore, we can apply advanced data wrangling techniques that involve machine learning models in preparing data for further use. For example, clustering, classification, and regression models can be used for creating new data that classifies records into different groups.

Data wrangling technique: Code example

Deleting records with missing values is a data wrangling technique. Let’s look at the following example to see how we can use Python to delete records with missing values.

  • Line 1: We first import the data manipulation library pandas , using import pandas as pd.

  • Line 2: We import the student.csv dataset using the read_csv() function and store it inside df.

  • line 3: We delete records with any missing value by applying the dropna() function to the df DataFrame and save the resulting records inside df_cleaned.

  • Line 4: We preview the clean dataset using the print() function and df_cleaned.head() to observe whether the changes were applied.

Press + to interact
Python 3.8
Files

Using very few lines of code, we've performed a data wrangling technique that prepares our dataset for further analysis.

In future lessons, we'll learn more about other data wrangling techniques we've applied in this example, such as importing libraries and reading files.

Data wrangling challenges

Sometimes, challenges arise when performing data wrangling. These challenges include:

  • Data accessibility: Accessing appropriate data to answer specific research questions can be problematic. In particular, this can be an issue if the data in question is sensitive and involves personally identifiable information. In many cases, getting approval from the relevant stakeholders can be lengthy and, as a result, delay the project.
  • Avoiding selection bias: If we have lots of data, deciding which data to work with can be a problem if we don’t have sufficient domain knowledge about the population.
  • Reproducible outcomes: Having final datasets that can be reproduced by other data stakeholders answering the same research question can be a challenge if no documentation exists that lists and explains the reasons for undertaking the considered data wrangling steps.
  • Data variability: Some projects require sourcing data from multiple data sources, such as spreadsheets, SQL databases, NoSQL databases, and even hard copy documents. Using these data sources and retrieving data stored in a nonstandard format can take time and further increase project costs.

The concept of tidy data

Many scholars have explored data wrangling in depth and have created approaches to effective data wrangling. One such scholar is Hadley Wickham, who wrote a paper titled “Tidy Data.” They propose a framework that makes it easy to tidy up messy datasets and define raw data, which they refer to as messy data, as data that is unorganized and comprises the following conditions:

  • Column headers are values, not variable names.

  • Multiple variables are stored in one column.

  • Variables are stored in both rows and columns.

  • Multiple types of observational units are stored in the same table.

  • A single observational unit is stored in multiple tables.

Using Wickham's philosophy, the main goal of data wrangling is to tidy up the data and, as a result, have a dataset that meets the following criteria:

  • Each observation is in a row

  • Each variable is in a column

  • Each value has its cell

This course will strive towards achieving tidy datasets with the outcomes mentioned above.