Data Wrangling

Get introduced to data wrangling in Python and its examples, its techniques, and how it compares to other related concepts.

Introduction

Data wrangling, also called data cleaning, data munging, or data transformation, is transforming data from its raw format into a meaningful format that can be used for further analysis, such as data visualization, data analysis, and machine learning.

Here are some examples of data wrangling:

  • Finding and removing syntax errors in data

  • Finding and handling missing values

  • Finding and handling outliers

  • Removing irrelevant data

  • Merging or splitting columns

The data wrangling process
The data wrangling process

Data wrangling vs. other related concepts

Let’s see the difference between data wrangling and data mining, preprocessing, data analysis, and data modeling.

Comparison of Different Data Concepts

Concept

Explanation

Data wrangling

Refers to transforming data from a raw format into a meaningful format for further analysis.

Data cleaning

One of the many steps undertaken during data wrangling.

Data mining

Refers to the general process of finding and extracting meaningful patterns from large datasets using various algorithms and techniques. We perform data wrangling techniques during the data preparation step of data mining.

Data visualization

Refers to the process of representing data using visual elements, such as charts. Before performing data visualization, we prepare data through data wrangling.

Data analysis

Refers to the process of applying statistical or logical techniques to describe, illustrate, and evaluate data. Data wrangling is a prerequisite step in data analysis.

Machine learning

Refers to the process of building systems that make predictions using insights from data. We prepare data for training machine learning models using data wrangling techniques.

Data wrangling in the context of machine learning

When working on machine learning problems, we usually use data wrangling techniques to prepare data for consumption by machine learning models. If the data in question is appropriate, then the model can make predictions accurately. If not, then the model tends to make mistakes during prediction.

This process of data preparation for model consumption is called feature engineering. During this process, new data can be created from existing data, while irrelevant data that doesn't contribute to model accuracy can be discarded.

Furthermore, we can apply advanced data wrangling techniques that involve machine learning models in preparing data for further use. For example, clustering, classification, and regression models can be used for creating new data that classifies records into different groups.

Data wrangling technique: Code example

Deleting records with missing values is a data wrangling technique. Let’s look at the following example to see how we can use Python to delete records with missing values.

  • Line 1: We first import the data manipulation library pandas , using import pandas as pd.

  • Line 2: We import the student.csv dataset using the read_csv() function and store it inside df.

  • line 3: We delete records with any missing value by applying the dropna() function to the df DataFrame and save the resulting records inside df_cleaned.

  • Line 4: We preview the clean dataset using the print() function and df_cleaned.head() to observe whether the changes were applied.

Press + to interact
main.py
students.csv
FIRST NAME,LAST NAME,EMAIL,GENDER,ROOM,RENT
Moritz,Beedle,mbeedle0@naver.com,Male,1,623
Newton,Martensen,nmartensen1@comcast.net,Male,2,798
Andree,Turri,aturri2@sourceforge.net,Female,3,744
Reid,Meriel,rmeriel3@pinterest.com,Male,4,560
Waldon,Zaniolini,wzaniolini4@artisteer.com,Male,5,733
Dillie,Heineking,dheineking5@tripod.com,Male,6,501
Ola,Sussans,osussans6@surveymonkey.com,Female,7,686
Elton,Notti,enotti7@moonfruit.com,Male,8,532
Nadine,Wolfart,nwolfart8@cbsnews.com,Female,9,987
Ddene,Grealey,dgrealey9@e-recht24.de,Female,10,634
Dore,Whapham,dwhaphama@forbes.com,Female,11,517
Rubin,Moens,rmoensb@squarespace.com,Male,12,984
Winifield,Dudeney,wdudeneyc@google.co.uk,Male,13,661
Tremayne,Alyukin,talyukind@opera.com,Male,14,570
Kynthia,Warnock,kwarnocke@dot.gov,Female,15,768
Marshal,Glencros,mglencrosf@dropbox.com,Male,16,905
Vachel,Hymas,vhymasg@seesaa.net,Male,17,961
Quincey,Hayen,qhayenh@multiply.com,Male,18,935
Arnie,McFetrich,amcfetrichi@pcworld.com,Male,19,939
Eran,Sharply,esharplyj@jigsy.com,Female,20,912
Trudey,Matzl,tmatzlk@usgs.gov,Female,21,723
Hilly,Beardall,hbeardalll@archive.org,Male,22,988
Arni,Myles,amylesm@house.gov,Male,23,605
Christabel,Bruyntjes,cbruyntjesn@parallels.com,Female,24,821
Alejandrina,Ebbitt,aebbitto@eepurl.com,Female,25,673
Morrie,Camsey,mcamseyp@wp.com,Male,26,849
Ronni,McKernon,rmckernonq@hostgator.com,Female,27,979
Toni,Berntsson,tberntssonr@tripadvisor.com,Female,28,715
Antoine,Willmetts,awillmettss@1und1.de,Male,29,891
Kat,Johnsee,kjohnseet@over-blog.com,Female,30,716
Iago,Neasam,ineasamu@sciencedaily.com,Male,31,727
Rosanna,Guirardin,rguirardinv@t-online.de,Female,32,996
Ingaborg,Thredder,ithredderw@webnode.com,Female,33,790
Davey,Shaughnessy,dshaughnessyx@dagondesign.com,Male,34,602
Dolley,Hall-Gough,dhallgoughy@barnesandnoble.com,Female,35,759
Harli,Boles,hbolesz@csmonitor.com,Female,36,577
Artemus,Gennrich,agennrich10@lulu.com,Genderqueer,37,788
Bastian,Potbury,bpotbury11@about.com,Agender,38,537
Genny,Grealey,ggrealey12@washington.edu,Female,39,563
Korella,MacDaid,kmacdaid13@weebly.com,Female,40,501
Aurore,Castagnasso,acastagnasso14@nsw.gov.au,Female,41,595
Estele,Vedenyapin,evedenyapin15@mac.com,Female,42,910
Blancha,Hamill,bhamill16@salon.com,Female,43,581
Giana,Charleston,gcharleston17@posterous.com,Polygender,44,874
Pen,Gullane,pgullane18@dailymail.co.uk,Male,45,941
Isabelita,Hartigan,ihartigan19@wp.com,Genderfluid,46,583
Elisabet,Hatchman,ehatchman1a@creativecommons.org,Female,47,553
Wyatan,Cominello,wcominello1b@cnet.com,Male,48,528
Tyson,Myers,tmyers1c@seattletimes.com,Male,49,536
Emlyn,Wones,ewones1d@xing.com,Male,50,815
Caresa,Aingell,caingell1e@ox.ac.uk,Female,51,998
Gusty,Womersley,gwomersley1f@yelp.com,Female,52,599
Oby,Turfin,oturfin1g@uol.com.br,Polygender,53,700
Fayina,Berendsen,fberendsen1h@ask.com,Non-binary,54,800
Esmaria,Ourry,eourry1i@phoca.cz,Female,55,830
Mignonne,Sinkinson,msinkinson1j@people.com.cn,Female,56,551
Forrest,Tomowicz,ftomowicz1k@hubpages.com,Male,57,954
Sumner,Bineham,sbineham1l@salon.com,Male,58,931
Kimberlyn,Tournie,ktournie1m@vkontakte.ru,Female,59,522
Jesus,Batstone,jbatstone1n@nih.gov,Male,60,552
Jon,Derges,jderges1o@chronoengine.com,Male,61,904
Felecia,Connow,fconnow1p@ning.com,Female,62,866
Dode,Segoe,dsegoe1q@vimeo.com,Female,63,873
Linnell,Segeswoeth,lsegeswoeth1r@purevolume.com,Female,64,973
Alidia,Lashbrook,alashbrook1s@phoca.cz,Female,65,921
Geordie,Balcers,gbalcers1t@nifty.com,Male,66,814
Olivia,Pond,opond1u@rakuten.co.jp,Female,67,907
Demetri,MacKibbon,dmackibbon1v@opera.com,Male,68,681
Helyn,Dagleas,hdagleas1w@utexas.edu,Female,69,947
Arvie,Yo,ayo1x@technorati.com,Male,70,941
Marian,Noe,mnoe1y@vistaprint.com,Female,71,663
Patin,Heinzler,pheinzler1z@naver.com,Male,72,659
Thurstan,Woodier,twoodier20@typepad.com,Male,73,891
Mignon,Lymer,mlymer21@alexa.com,Female,74,658
Agustin,Terron,aterron22@abc.net.au,Male,75,573
Kipper,MacMenemy,kmacmenemy23@mlb.com,Male,76,695
Lynda,Hubbock,lhubbock24@gmpg.org,Female,77,781
Jeanelle,Follows,jfollows25@google.ru,Female,78,864
Flemming,Gluyas,fgluyas26@psu.edu,Male,79,722
Mady,Gerriet,mgerriet27@i2i.jp,Female,80,759
Randolph,Cadigan,rcadigan28@loc.gov,Male,81,659
Dunc,Scoggan,dscoggan29@comsenz.com,Male,82,752
Sybyl,Yarrall,syarrall2a@hostgator.com,Female,83,875
Tuckie,Stud,tstud2b@qq.com,Male,84,665
Miran,Vockins,mvockins2c@de.vu,Female,85,746
Lamar,Giuroni,lgiuroni2d@myspace.com,Male,86,757
Barty,Fonzo,bfonzo2e@liveinternet.ru,Male,87,509
Gerald,Smiths,gsmiths2f@dailymail.co.uk,Male,88,786
Holli,Wellesley,hwellesley2g@npr.org,Female,89,547
Eldridge,Kaasman,ekaasman2h@dedecms.com,Male,90,718
Tobe,Fadian,tfadian2i@wordpress.org,Female,91,931
Bob,Girardi,bgirardi2j@senate.gov,Male,92,851
Garrick,Havock,ghavock2k@cbc.ca,Male,93,996
Corabella,Growden,cgrowden2l@imdb.com,Female,94,804
Babita,Erskine,berskine2m@ed.gov,Female,95,891
Vic,Laurie,vlaurie2n@google.ru,Bigender,96,780
Eduard,McWhannel,emcwhannel2o@noaa.gov,Male,97,721
Obie,Peyton,opeyton2p@usnews.com,Male,98,932
Morly,Cohen,mcohen2q@ezinearticles.com,Male,99,743
Haleigh,Casa,hcasa2r@about.me,Male,100,748

Using very few lines of code, we've performed a data wrangling technique that prepares our dataset for further analysis.

In future lessons, we'll learn more about other data wrangling techniques we've applied in this example, such as importing libraries and reading files.

Data wrangling challenges

Sometimes, challenges arise when performing data wrangling. These challenges include:

  • Data accessibility: Accessing appropriate data to answer specific research questions can be problematic. In particular, this can be an issue if the data in question is sensitive and involves personally identifiable information. In many cases, getting approval from the relevant stakeholders can be lengthy and, as a result, delay the project.
  • Avoiding selection bias: If we have lots of data, deciding which data to work with can be a problem if we don’t have sufficient domain knowledge about the population.
  • Reproducible outcomes: Having final datasets that can be reproduced by other data stakeholders answering the same research question can be a challenge if no documentation exists that lists and explains the reasons for undertaking the considered data wrangling steps.
  • Data variability: Some projects require sourcing data from multiple data sources, such as spreadsheets, SQL databases, NoSQL databases, and even hard copy documents. Using these data sources and retrieving data stored in a nonstandard format can take time and further increase project costs.

The concept of tidy data

Many scholars have explored data wrangling in depth and have created approaches to effective data wrangling. One such scholar is Hadley Wickham, who wrote a paper titled “Tidy Data.” They propose a framework that makes it easy to tidy up messy datasets and define raw data, which they refer to as messy data, as data that is unorganized and comprises the following conditions:

  • Column headers are values, not variable names.

  • Multiple variables are stored in one column.

  • Variables are stored in both rows and columns.

  • Multiple types of observational units are stored in the same table.

  • A single observational unit is stored in multiple tables.

Using Wickham's philosophy, the main goal of data wrangling is to tidy up the data and, as a result, have a dataset that meets the following criteria:

  • Each observation is in a row

  • Each variable is in a column

  • Each value has its cell

This course will strive towards achieving tidy datasets with the outcomes mentioned above.