How to replace missing values in Julia DataFrames

Raw data is untidy and messy and needs to be cleaned by applying various preprocessing techniques. One of these techniques involves handling missing values.

Missing values, if not handled appropriately, might result in inaccurate or biased results. So, it is important to ensure that the data is consistent and reliable, to give accurate results when carrying out analysis or modeling.

In this Answer, we’ll be looking at different ways of replacing missing values in data in Julia.

1. Using the replace!() function.

Using replace!() function, we can replace missing values with a value of our choice such as zero or the mean of the column or even the mode as long the value is the same type as the column type.

Let’s replace missing values in marks column in df with zero as shown in the following code.

Code example

using DataFrames
df = DataFrame(student_id=[1,2,3,4,5],
name = ["Amy","Jane","John","Nancy","Peter"],
marks=[50,60,missing,missing,30],
age=[15,20,19,16,15])
replace!(df.marks, missing => 0)
println(df)

Code explanation

In the example above:

Line 1: We import the DataFrames library.

Lines 2–5: We create a DataFrame with 4 columns and 5 rows. Each row represents students’ information. The marks column contains missing values.

Line 7: We use replace!() function to modify the original DataFrame, by replacing the missing values in marks with zero.

Line 8: We print out the new DataFrame.

2.Using the coalesce() function

The coalesce() function is another way we can replace missing values.

Let’s replace the missing values in the marks and age columns with zero as shown in the following code.

Code example

using Queryverse
df = DataFrame(student_id=[1,2,3,4,5],
name = ["Amy","Jane","John","Nancy","Peter"],
marks=[50,60,missing,missing,30],
age=[15,20,missing,18,missing])
df = coalesce.(df, 0)
println(df)

Code explanation

In the example above:

Line 1: We import Queryverse library.

Lines 2–5: We create a DataFrame with 4 columns and 5 rows. Each row represents students’ information. The marks and age columns contain missing values.

Line 7: We use coalesce() to replace the missing values in marks and age with zero. We then update df to the new DataFrame.

Line 8: We print out the new DataFrame.

3. Using mapcols() and coalesce() functions.

In the example below, we use mapcols() to apply coalesce() to each column in the DataFrame and replace the missing values with zero.

Note: We can also use mapcols!() instead of mapcols(), which will modify the original DataFrame, that is in-place.

Code example

using Queryverse
df = DataFrame(student_id=[1,2,3,4,5],
name = ["Amy","Jane","John","Nancy","Peter"],
marks=[50,60,missing,missing,30],
age=[15,missing,19,missing,15])
#use mapcols
df = mapcols(x -> coalesce.(x, 0), df)
println(df)

Code explanation

In the example above:

Line 1: We import the Queryverse library.

Lines 2–5: We create a DataFrame with 4 columns and 5 rows. Each row represents students’ information. The marks and age columns contain missing values.

Line 8: We use mapcols() function to apply coalesce() function to the marks and age columns in the DataFrame which will replace the missing values with zero. We then update df to the new DataFrame.

Line 10: We print out the new DataFrame.

4. Using the transform!() and coalesce() functions

Similar to the mapcols() function above, we can also use transform() to apply coalesce() to each column in the DataFrame and replace the missing values with zero as shown in the following code.

Code example

using DataFrames
df = DataFrame(student_id=[1,2,3,4,5],
name = ["Amy","Jane","John","Nancy","Peter"],
marks=[50,60,missing,missing,30],
age=[15,missing,19,18,15])
transform!(df, [:"marks", :"age"] .=> ByRow(x ->
coalesce(x, 0)) .=> [:"marks", :"age"])
println(df)

Code explanation

In the example above:

Line 1: We import the DataFrames library.

Lines 2–5: We create a DataFrame with 4 columns and 5 rows. Each row represents students’ information. The marks and age columns contain missing values.

Lines 7–8: We use transform!() function to apply coalesce() function to the marks and age columns in the DataFrame which will replace the missing values with zero. transform!() modifies the original DataFrame so we don’t need to update df by assigning the new DataFrame to df.

Line 9: We print out the new DataFrame.

Copyright ©2024 Educative, Inc. All rights reserved