Data Wrangling
Get introduced to data wrangling in Python and its examples, its techniques, and how it compares to other related concepts.
Introduction
Data wrangling, also called data cleaning, data munging, or data transformation, is transforming data from its raw format into a meaningful format that can be used for further analysis, such as data visualization, data analysis, and machine learning.
Here are some examples of data wrangling:
Finding and removing syntax errors in data
Finding and handling missing values
Finding and handling outliers
Removing irrelevant data
Merging or splitting columns
Data wrangling vs. other related concepts
Let’s see the difference between data wrangling and data mining, preprocessing, data analysis, and data modeling.
Comparison of Different Data Concepts
Concept | Explanation |
Data wrangling | Refers to transforming data from a raw format into a meaningful format for further analysis. |
Data cleaning | One of the many steps undertaken during data wrangling. |
Data mining | Refers to the general process of finding and extracting meaningful patterns from large datasets using various algorithms and techniques. We perform data wrangling techniques during the data preparation step of data mining. |
Data visualization | Refers to the process of representing data using visual elements, such as charts. Before performing data visualization, we prepare data through data wrangling. |
Data analysis | Refers to the process of applying statistical or logical techniques to describe, illustrate, and evaluate data. Data wrangling is a prerequisite step in data analysis. |
Machine learning | Refers to the process of building systems that make predictions using insights from data. We prepare data for training machine learning models using data wrangling techniques. |
Data wrangling in the context of machine learning
When working on machine learning problems, we usually use data wrangling techniques to prepare data for consumption by machine learning models. If the data in question is appropriate, then the model can make predictions accurately. If not, then the model tends to make mistakes during prediction.
This process of data preparation for model consumption is called feature engineering. During this process, new data can be created from existing data, while irrelevant data that doesn't contribute to model accuracy can be discarded.
Furthermore, we can apply advanced data wrangling techniques that involve machine learning models in preparing data for further use. For example, clustering, classification, and regression models can be used for creating new data that classifies records into different groups.
Data wrangling technique: Code example
Deleting records with missing values is a data wrangling technique. Let’s look at the following example to see how we can use Python to delete records with missing values.
Line 1: We first import the data manipulation library
pandas
, usingimport pandas as pd
.Line 2: We import the
student.csv
dataset using theread_csv()
function and store it insidedf
.line 3: We delete records with any missing value by applying the
dropna()
function to thedf
DataFrame and save the resulting records insidedf_cleaned
.Line 4: We preview the clean dataset using the
print()
function anddf_cleaned.head()
to observe whether the changes were applied.
FIRST NAME,LAST NAME,EMAIL,GENDER,ROOM,RENTMoritz,Beedle,mbeedle0@naver.com,Male,1,623Newton,Martensen,nmartensen1@comcast.net,Male,2,798Andree,Turri,aturri2@sourceforge.net,Female,3,744Reid,Meriel,rmeriel3@pinterest.com,Male,4,560Waldon,Zaniolini,wzaniolini4@artisteer.com,Male,5,733Dillie,Heineking,dheineking5@tripod.com,Male,6,501Ola,Sussans,osussans6@surveymonkey.com,Female,7,686Elton,Notti,enotti7@moonfruit.com,Male,8,532Nadine,Wolfart,nwolfart8@cbsnews.com,Female,9,987Ddene,Grealey,dgrealey9@e-recht24.de,Female,10,634Dore,Whapham,dwhaphama@forbes.com,Female,11,517Rubin,Moens,rmoensb@squarespace.com,Male,12,984Winifield,Dudeney,wdudeneyc@google.co.uk,Male,13,661Tremayne,Alyukin,talyukind@opera.com,Male,14,570Kynthia,Warnock,kwarnocke@dot.gov,Female,15,768Marshal,Glencros,mglencrosf@dropbox.com,Male,16,905Vachel,Hymas,vhymasg@seesaa.net,Male,17,961Quincey,Hayen,qhayenh@multiply.com,Male,18,935Arnie,McFetrich,amcfetrichi@pcworld.com,Male,19,939Eran,Sharply,esharplyj@jigsy.com,Female,20,912Trudey,Matzl,tmatzlk@usgs.gov,Female,21,723Hilly,Beardall,hbeardalll@archive.org,Male,22,988Arni,Myles,amylesm@house.gov,Male,23,605Christabel,Bruyntjes,cbruyntjesn@parallels.com,Female,24,821Alejandrina,Ebbitt,aebbitto@eepurl.com,Female,25,673Morrie,Camsey,mcamseyp@wp.com,Male,26,849Ronni,McKernon,rmckernonq@hostgator.com,Female,27,979Toni,Berntsson,tberntssonr@tripadvisor.com,Female,28,715Antoine,Willmetts,awillmettss@1und1.de,Male,29,891Kat,Johnsee,kjohnseet@over-blog.com,Female,30,716Iago,Neasam,ineasamu@sciencedaily.com,Male,31,727Rosanna,Guirardin,rguirardinv@t-online.de,Female,32,996Ingaborg,Thredder,ithredderw@webnode.com,Female,33,790Davey,Shaughnessy,dshaughnessyx@dagondesign.com,Male,34,602Dolley,Hall-Gough,dhallgoughy@barnesandnoble.com,Female,35,759Harli,Boles,hbolesz@csmonitor.com,Female,36,577Artemus,Gennrich,agennrich10@lulu.com,Genderqueer,37,788Bastian,Potbury,bpotbury11@about.com,Agender,38,537Genny,Grealey,ggrealey12@washington.edu,Female,39,563Korella,MacDaid,kmacdaid13@weebly.com,Female,40,501Aurore,Castagnasso,acastagnasso14@nsw.gov.au,Female,41,595Estele,Vedenyapin,evedenyapin15@mac.com,Female,42,910Blancha,Hamill,bhamill16@salon.com,Female,43,581Giana,Charleston,gcharleston17@posterous.com,Polygender,44,874Pen,Gullane,pgullane18@dailymail.co.uk,Male,45,941Isabelita,Hartigan,ihartigan19@wp.com,Genderfluid,46,583Elisabet,Hatchman,ehatchman1a@creativecommons.org,Female,47,553Wyatan,Cominello,wcominello1b@cnet.com,Male,48,528Tyson,Myers,tmyers1c@seattletimes.com,Male,49,536Emlyn,Wones,ewones1d@xing.com,Male,50,815Caresa,Aingell,caingell1e@ox.ac.uk,Female,51,998Gusty,Womersley,gwomersley1f@yelp.com,Female,52,599Oby,Turfin,oturfin1g@uol.com.br,Polygender,53,700Fayina,Berendsen,fberendsen1h@ask.com,Non-binary,54,800Esmaria,Ourry,eourry1i@phoca.cz,Female,55,830Mignonne,Sinkinson,msinkinson1j@people.com.cn,Female,56,551Forrest,Tomowicz,ftomowicz1k@hubpages.com,Male,57,954Sumner,Bineham,sbineham1l@salon.com,Male,58,931Kimberlyn,Tournie,ktournie1m@vkontakte.ru,Female,59,522Jesus,Batstone,jbatstone1n@nih.gov,Male,60,552Jon,Derges,jderges1o@chronoengine.com,Male,61,904Felecia,Connow,fconnow1p@ning.com,Female,62,866Dode,Segoe,dsegoe1q@vimeo.com,Female,63,873Linnell,Segeswoeth,lsegeswoeth1r@purevolume.com,Female,64,973Alidia,Lashbrook,alashbrook1s@phoca.cz,Female,65,921Geordie,Balcers,gbalcers1t@nifty.com,Male,66,814Olivia,Pond,opond1u@rakuten.co.jp,Female,67,907Demetri,MacKibbon,dmackibbon1v@opera.com,Male,68,681Helyn,Dagleas,hdagleas1w@utexas.edu,Female,69,947Arvie,Yo,ayo1x@technorati.com,Male,70,941Marian,Noe,mnoe1y@vistaprint.com,Female,71,663Patin,Heinzler,pheinzler1z@naver.com,Male,72,659Thurstan,Woodier,twoodier20@typepad.com,Male,73,891Mignon,Lymer,mlymer21@alexa.com,Female,74,658Agustin,Terron,aterron22@abc.net.au,Male,75,573Kipper,MacMenemy,kmacmenemy23@mlb.com,Male,76,695Lynda,Hubbock,lhubbock24@gmpg.org,Female,77,781Jeanelle,Follows,jfollows25@google.ru,Female,78,864Flemming,Gluyas,fgluyas26@psu.edu,Male,79,722Mady,Gerriet,mgerriet27@i2i.jp,Female,80,759Randolph,Cadigan,rcadigan28@loc.gov,Male,81,659Dunc,Scoggan,dscoggan29@comsenz.com,Male,82,752Sybyl,Yarrall,syarrall2a@hostgator.com,Female,83,875Tuckie,Stud,tstud2b@qq.com,Male,84,665Miran,Vockins,mvockins2c@de.vu,Female,85,746Lamar,Giuroni,lgiuroni2d@myspace.com,Male,86,757Barty,Fonzo,bfonzo2e@liveinternet.ru,Male,87,509Gerald,Smiths,gsmiths2f@dailymail.co.uk,Male,88,786Holli,Wellesley,hwellesley2g@npr.org,Female,89,547Eldridge,Kaasman,ekaasman2h@dedecms.com,Male,90,718Tobe,Fadian,tfadian2i@wordpress.org,Female,91,931Bob,Girardi,bgirardi2j@senate.gov,Male,92,851Garrick,Havock,ghavock2k@cbc.ca,Male,93,996Corabella,Growden,cgrowden2l@imdb.com,Female,94,804Babita,Erskine,berskine2m@ed.gov,Female,95,891Vic,Laurie,vlaurie2n@google.ru,Bigender,96,780Eduard,McWhannel,emcwhannel2o@noaa.gov,Male,97,721Obie,Peyton,opeyton2p@usnews.com,Male,98,932Morly,Cohen,mcohen2q@ezinearticles.com,Male,99,743Haleigh,Casa,hcasa2r@about.me,Male,100,748
Using very few lines of code, we've performed a data wrangling technique that prepares our dataset for further analysis.
In future lessons, we'll learn more about other data wrangling techniques we've applied in this example, such as importing libraries and reading files.
Data wrangling challenges
Sometimes, challenges arise when performing data wrangling. These challenges include:
- Data accessibility: Accessing appropriate data to answer specific research questions can be problematic. In particular, this can be an issue if the data in question is sensitive and involves personally identifiable information. In many cases, getting approval from the relevant stakeholders can be lengthy and, as a result, delay the project.
- Avoiding selection bias: If we have lots of data, deciding which data to work with can be a problem if we don’t have sufficient domain knowledge about the population.
- Reproducible outcomes: Having final datasets that can be reproduced by other data stakeholders answering the same research question can be a challenge if no documentation exists that lists and explains the reasons for undertaking the considered data wrangling steps.
- Data variability: Some projects require sourcing data from multiple data sources, such as spreadsheets, SQL databases, NoSQL databases, and even hard copy documents. Using these data sources and retrieving data stored in a nonstandard format can take time and further increase project costs.
The concept of tidy data
Many scholars have explored data wrangling in depth and have created approaches to effective data wrangling. One such scholar is Hadley Wickham, who wrote a paper titled “Tidy Data.” They propose a framework that makes it easy to tidy up messy datasets and define raw data, which they refer to as messy data, as data that is unorganized and comprises the following conditions:
Column headers are values, not variable names.
Multiple variables are stored in one column.
Variables are stored in both rows and columns.
Multiple types of observational units are stored in the same table.
A single observational unit is stored in multiple tables.
Using Wickham's philosophy, the main goal of data wrangling is to tidy up the data and, as a result, have a dataset that meets the following criteria:
Each observation is in a row
Each variable is in a column
Each value has its cell
This course will strive towards achieving tidy datasets with the outcomes mentioned above.