How to Apply Data Wrangling
Explore the essentials of applying data wrangling techniques using Python and other tools. Understand who should use these skills, common methods like cleaning and merging data, and the key steps from discovery to publishing. This lesson equips you with knowledge to prepare reliable datasets for analysis and machine learning.
Who should apply data wrangling?
Any person working with data to answer business questions should be aware of appropriate data wrangling skills and whether or not they can apply them. This includes relevant stakeholders, such as managers or project owners.
If we aspire to become data analysts, data scientists, data engineers, or machine learning engineers, we must learn how to apply data wrangling skills. This is because data projects require a degree of data manipulation before any analysis is carried out.
For data analysts and data scientists, data wrangling will apply when preparing data to create reports. For machine learning engineers, data wrangling will be applicable for preparing data to create machine learning models. Finally, data wrangling will be applicable for data engineers when creating data pipelines during the data transformation stage.
Data wrangling tools
This course will teach us how to apply data wrangling techniques using Python, a general-purpose programming language that many data engineers, analysts, and scientists work with.
But apart from Python, we can use many tools and programming languages to apply data wrangling techniques. Some of these tools include Talend, Alteryx, and Datameer, which are proprietary, while others, such as Data Wrangler and csvkit, are free for download and use.
With the knowledge acquired in the course, we'll be able to wrangle any dataset for data visualization, data analysis, and machine learning. More specifically, we'll be able to use the following Python libraries to perform data wrangling:
pandas: This is a data manipulation library that provides data wrangling functions.
NumPy: This scientific computation library provides functions for handling numerical data.
These are just a few of the many data tools we will work with when applying data wrangling. If we want to work with another language to achieve a transformed dataset, we can also use libraries or packages for manipulating data in that language. For example, if we're working with the R programming language, we can use the tidyr package to prepare data for analysis.
We can also choose the following tools to perform data wrangling with standalone software applications:
Excel spreadsheets: A data analysis software application for working with data.
Google Sheets: A cloud-based data analysis software application for working with data.
OpenRefine: An advanced data transformation desktop application.
dplyr: A package based on R that provides data manipulation functions for the R programming language.
Dataprep: A cloud application that can visually explore, clean, and prepare data for analysis and machine learning.
Data wrangling techniques
We'll cover the following data wrangling techniques in this course:
Reading data from CSV and Excel files
Performing standardization
Removing syntax errors and irrelevant data
Finding and dealing with duplicates and missing data
Finding and dealing with outliers
Filtering and sorting data
Splitting, merging, and concatenating data
Exporting data
Aside from the outlined data wrangling techniques, we can perform many other techniques on a raw dataset. It's also important to note that we usually apply the most suitable ones for answering a business question. We'll leave the decision of where and when to use these data wrangling techniques up to your judgment.
Steps in data wrangling
When performing data wrangling, it's important to consider the following steps that yield a clean and usable dataset for analysis.
Step 1: Discovery
This first step involves exploring the data to understand its structure and records. This might involve understanding trends, patterns, relationships, and prominent problems, such as outliers and missing data.
Step 2: Organizing or structuring
The next step involves organizing the data because it's unorganized in its raw format. The goal is to make it easier for interpretation and analysis. During this step, data from multiple sources in different formats can be aggregated to form a complete dataset.
Stage 3: Data cleaning
This step involves addressing inherent issues in a dataset, such as missing values, outliers, duplicates, inaccurate data, syntax errors, irrelevant data, and so on.
Step 4: Enriching
After data cleaning, we might need more data from external sources to answer our research question. Therefore, we need to incorporate more data into our existing dataset to improve data quality and reliability.
Step 5: Validating
During this step, we cross-check our data to confirm whether the data is fit for analysis. We can compare our data with other similar data from credible external sources. For example, if we worked with population data for countries, we can compare it with population from a reliable data source, such as the World Bank Open Data website.
Step 6: Publishing
Once data has been validated, it can be published for other stakeholders to perform further analysis. This further analysis might even involve exporting the final dataset into a database so that it can be used to create more extensive and complex datasets.
Best practices for data wrangling
Since data wrangling techniques can be implemented in various ways, it’s essential to adhere to best practices. These practices allow us to have final datasets that are reliable, accurate, and reproducible.
-
Having domain knowledge: To determine what data is relevant for analysis, data wranglers need to deeply understand the project domain.
-
Engaging stakeholders: This helps data wranglers align their work with the research problem. In addition, they are also kept aware of the changing data wrangling needs.
-
Providing documentation: To make our work reproducible, we need to list and explain the logical steps taken during data wrangling.
-
Adopting appropriate and efficient tools: Aside from adopting data wrangling tools appropriate for handling the data we’re working with, we also need to adopt emerging tools that make our work easier through automation.