How to replace missing values in Julia DataFrames
Raw data is untidy and messy and needs to be cleaned by applying various preprocessing techniques. One of these techniques involves handling missing values.
Missing values, if not handled appropriately, might result in inaccurate or biased results. So, it is important to ensure that the data is consistent and reliable, to give accurate results when carrying out analysis or modeling.
In this Answer, we’ll be looking at different ways of replacing missing values in data in Julia.
1. Using the replace!() function.
Using replace!() function, we can replace missing values with a value of our choice such as zero or the mean of the column or even the mode as long the value is the same type as the column type.
Let’s replace missing values in marks column in df with zero as shown in the following code.
Code example
using DataFramesdf = DataFrame(student_id=[1,2,3,4,5],name = ["Amy","Jane","John","Nancy","Peter"],marks=[50,60,missing,missing,30],age=[15,20,19,16,15])replace!(df.marks, missing => 0)println(df)
Code explanation
In the example above:
Line 1: We import the DataFrames library.
Lines 2–5: We create a DataFrame with 4 columns and 5 rows. Each row represents students’ information. The marks column contains missing values.
Line 7: We use replace!() function to modify the original DataFrame, by replacing the missing values in marks with zero.
Line 8: We print out the new DataFrame.
2.Using the coalesce() function
The coalesce() function is another way we can replace missing values.
Let’s replace the missing values in the marks and age columns with zero as shown in the following code.
Code example
using Queryversedf = DataFrame(student_id=[1,2,3,4,5],name = ["Amy","Jane","John","Nancy","Peter"],marks=[50,60,missing,missing,30],age=[15,20,missing,18,missing])df = coalesce.(df, 0)println(df)
Code explanation
In the example above:
Line 1: We import Queryverse library.
Lines 2–5: We create a DataFrame with 4 columns and 5 rows. Each row represents students’ information. The marks and age columns contain missing values.
Line 7: We use coalesce() to replace the missing values in marks and age with zero. We then update df to the new DataFrame.
Line 8: We print out the new DataFrame.
3. Using mapcols() and coalesce() functions.
In the example below, we use mapcols() to apply coalesce() to each column in the DataFrame and replace the missing values with zero.
Note: We can also use
mapcols!()instead ofmapcols(), which will modify the original DataFrame, that is in-place.
Code example
using Queryversedf = DataFrame(student_id=[1,2,3,4,5],name = ["Amy","Jane","John","Nancy","Peter"],marks=[50,60,missing,missing,30],age=[15,missing,19,missing,15])#use mapcolsdf = mapcols(x -> coalesce.(x, 0), df)println(df)
Code explanation
In the example above:
Line 1: We import the Queryverse library.
Lines 2–5: We create a DataFrame with 4 columns and 5 rows. Each row represents students’ information. The marks and age columns contain missing values.
Line 8: We use mapcols() function to apply coalesce() function to the marks and age columns in the DataFrame which will replace the missing values with zero. We then update df to the new DataFrame.
Line 10: We print out the new DataFrame.
4. Using the transform!() and coalesce() functions
Similar to the mapcols() function above, we can also use transform() to apply coalesce() to each column in the DataFrame and replace the missing values with zero as shown in the following code.
Code example
using DataFramesdf = DataFrame(student_id=[1,2,3,4,5],name = ["Amy","Jane","John","Nancy","Peter"],marks=[50,60,missing,missing,30],age=[15,missing,19,18,15])transform!(df, [:"marks", :"age"] .=> ByRow(x ->coalesce(x, 0)) .=> [:"marks", :"age"])println(df)
Code explanation
In the example above:
Line 1: We import the DataFrames library.
Lines 2–5: We create a DataFrame with 4 columns and 5 rows. Each row represents students’ information. The marks and age columns contain missing values.
Lines 7–8: We use transform!() function to apply coalesce() function to the marks and age columns in the DataFrame which will replace the missing values with zero. transform!() modifies the original DataFrame so we don’t need to update df by assigning the new DataFrame to df.
Line 9: We print out the new DataFrame.
Free Resources