...

Data Wrangling

Get introduced to data wrangling in Python and its examples, its techniques, and how it compares to other related concepts.

We'll cover the following...

Introduction
Data wrangling vs. other related concepts
Data wrangling in the context of machine learning
Data wrangling technique: Code example
Data wrangling challenges
The concept of tidy data

Press + to interact

Comparison of Different Data Concepts

Concept	Explanation
Data wrangling	Refers to transforming data from a raw format into a meaningful format for further analysis.
Data cleaning	One of the many steps undertaken during data wrangling.
Data mining	Refers to the general process of finding and extracting meaningful patterns from large datasets using various algorithms and techniques. We perform data wrangling techniques during the data preparation step of data mining.
Data visualization	Refers to the process of representing data using visual elements, such as charts. Before performing data visualization, we prepare data through data wrangling.
Data analysis	Refers to the process of applying statistical or logical techniques to describe, illustrate, and evaluate data. Data wrangling is a prerequisite step in data analysis.
Machine learning	Refers to the process of building systems that make predictions using insights from data. We prepare data for training machine learning models using data wrangling techniques.

Data wrangling in the context of machine learning

When working on machine learning problems, we usually use data wrangling techniques to prepare data for consumption by machine learning models. If the data in question is appropriate, then the model can make predictions accurately. If not, then the model tends to make mistakes during prediction.

This process of data preparation for model consumption is called feature engineering. During this process, new data can be created from existing data, while irrelevant data that doesn't contribute to model accuracy can be discarded.

Furthermore, we can apply advanced data wrangling techniques that involve machine learning models in preparing data for further use. For example, clustering, classification, and regression models can be used for creating new data that classifies records into different groups.

Data wrangling technique: Code example

Deleting records with missing values is a data wrangling technique. Let’s look at the following example to see how we can use Python to delete records with missing values.

Line 1: We first import the data manipulation library pandas , using import pandas as pd.
Line 2: We import the student.csv dataset using the read_csv() function and store it inside df.
line 3: We delete records with any missing value by applying the dropna() function to the df DataFrame and save the resulting records inside df_cleaned.
Line 4: We preview the clean dataset using the print() function and df_cleaned.head() to observe whether the changes were applied.

Press + to interact

The concept of tidy data

Many scholars have explored data wrangling in depth and have created approaches to effective data wrangling. One such scholar is Hadley Wickham, who wrote a paper titled “Tidy Data.” They propose a framework that makes it easy to tidy up messy datasets and define raw data, which they refer to as messy data, as data that is unorganized and comprises the following conditions:

Column headers are values, not variable names.
Multiple variables are stored in one column.
Variables are stored in both rows and columns.
Multiple types of observational units are stored in the same table.
A single observational unit is stored in multiple tables.

Using Wickham's philosophy, the main goal of data wrangling is to tidy up the data and, as a result, have a dataset that meets the following criteria:

Each observation is in a row
Each variable is in a column
Each value has its cell

This course will strive towards achieving tidy datasets with the outcomes mentioned above.

About This Course

Introduction to Data Wrangling

Reading Data

Standardization

Syntax Errors and Irrelevant Data

Duplicate and Missing Data

Filtering and Sorting

Splitting, Combining, and Merging

Handling Outliers

Exporting Data

Humanitarian Aid Project

Conclusion

Data Wrangling

Introduction

Data wrangling vs. other related concepts

Comparison of Different Data Concepts

Data wrangling in the context of machine learning

Data wrangling technique: Code example

Data wrangling challenges

The concept of tidy data