Raw data is untidy and messy and needs to be cleaned by applying various preprocessing techniques. One of these techniques involves handling missing values.
Missing values, if not handled appropriately, might result in inaccurate or biased results. So, it is important to ensure that the data is consistent and reliable, to give accurate results when carrying out analysis or modeling.
In this Answer, we’ll be looking at different ways of replacing missing values in data in Julia.
replace!()
function.Using replace!()
function, we can replace missing values with a value of our choice such as zero or the mean of the column or even the mode as long the value is the same type as the column type.
Let’s replace missing values in marks
column in df
with zero as shown in the following code.
using DataFramesdf = DataFrame(student_id=[1,2,3,4,5],name = ["Amy","Jane","John","Nancy","Peter"],marks=[50,60,missing,missing,30],age=[15,20,19,16,15])replace!(df.marks, missing => 0)println(df)
In the example above:
Line 1: We import the DataFrames
library.
Lines 2–5: We create a DataFrame with 4 columns and 5 rows. Each row represents students’ information. The marks
column contains missing values.
Line 7: We use replace!()
function to modify the original DataFrame, by replacing the missing values in marks
with zero.
Line 8: We print out the new DataFrame.
coalesce()
functionThe coalesce()
function is another way we can replace missing values.
Let’s replace the missing values in the marks
and age
columns with zero as shown in the following code.
using Queryversedf = DataFrame(student_id=[1,2,3,4,5],name = ["Amy","Jane","John","Nancy","Peter"],marks=[50,60,missing,missing,30],age=[15,20,missing,18,missing])df = coalesce.(df, 0)println(df)
In the example above:
Line 1: We import Queryverse
library.
Lines 2–5: We create a DataFrame with 4 columns and 5 rows. Each row represents students’ information. The marks
and age
columns contain missing values.
Line 7: We use coalesce()
to replace the missing values in marks
and age
with zero. We then update df
to the new DataFrame.
Line 8: We print out the new DataFrame.
mapcols()
and coalesce()
functions.In the example below, we use mapcols()
to apply coalesce()
to each column in the DataFrame and replace the missing values with zero.
Note: We can also use
mapcols!()
instead ofmapcols()
, which will modify the original DataFrame, that is in-place.
using Queryversedf = DataFrame(student_id=[1,2,3,4,5],name = ["Amy","Jane","John","Nancy","Peter"],marks=[50,60,missing,missing,30],age=[15,missing,19,missing,15])#use mapcolsdf = mapcols(x -> coalesce.(x, 0), df)println(df)
In the example above:
Line 1: We import the Queryverse
library.
Lines 2–5: We create a DataFrame with 4 columns and 5 rows. Each row represents students’ information. The marks
and age
columns contain missing values.
Line 8: We use mapcols()
function to apply coalesce()
function to the marks
and age
columns in the DataFrame which will replace the missing values with zero. We then update df
to the new DataFrame.
Line 10: We print out the new DataFrame.
transform!()
and coalesce()
functionsSimilar to the mapcols()
function above, we can also use transform()
to apply coalesce()
to each column in the DataFrame and replace the missing values with zero as shown in the following code.
using DataFramesdf = DataFrame(student_id=[1,2,3,4,5],name = ["Amy","Jane","John","Nancy","Peter"],marks=[50,60,missing,missing,30],age=[15,missing,19,18,15])transform!(df, [:"marks", :"age"] .=> ByRow(x ->coalesce(x, 0)) .=> [:"marks", :"age"])println(df)
In the example above:
Line 1: We import the DataFrames
library.
Lines 2–5: We create a DataFrame with 4 columns and 5 rows. Each row represents students’ information. The marks
and age
columns contain missing values.
Lines 7–8: We use transform!()
function to apply coalesce()
function to the marks
and age
columns in the DataFrame which will replace the missing values with zero. transform!()
modifies the original DataFrame so we don’t need to update df
by assigning the new DataFrame to df
.
Line 9: We print out the new DataFrame.