Julia is a powerful data science language known for its robust capabilities in numerical computations.
While working with DataFrames, data manipulation is inevitable. One such data manipulation is removing columns that aren’t needed in the analysis.
There are several ways of removing columns by name in a Julia DataFrame, and in this Answer, we will review a few ways to do so.
select!()
statementWe can use the select!()
and Not
statements to specify which columns to remove. The Not
statement selects all columns except the ones specified. Using the select!()
statement, the original DataFrame is modified directly. This is referred to as modification in place.
Here’s the syntax for using the select!()
method:
select!(df,Not([:A,:B]))
However, select()
can still be used, but this method creates a copy of the original DataFrame and changes it. This method will need a variable to be assigned to it.
We can use the select()
method in the following way:
df = select(df,Not([:A,:B]))
Let’s understand the select!()
method using the code example below.
using DataFramesdf = DataFrame(student_id=[1,2,3,4,5],name = ["Amy","Jane","John","Nancy","Peter"],marks=[50,60,40,47,30],age=[15,16,19,18,15])select!(df, Not([:"name", :"age"]))println(df)
Let’s explain the code provided above.
Line 1: We use the already imported DataFrames
library.
Lines 2–5: We create a DataFrame with four columns, namely student_id
, name
, marks
, and age
, and five rows, where each row represents students’ information.
Line 6: We use select!()
to select all columns in the DataFrame except name
and age
.
Line 7: We print out the modified DataFrame.
select!()
and setdiff()
The select!()
and setdiff()
are used in the following way:
select!(df, Not(setdiff(names(df), [:A, :B])))
The setdiff()
function returns the set difference between two arrays, so setdiff(names(df), [:A, :B])
returns the names of all columns in df
except for A
and B
. The Not
function negates the selection, so the example above only returns columns A
and B
. After it, select!()
modifies the original DataFrame.
using DataFramesdf = DataFrame(student_id=[1,2,3,4,5],name = ["Amy","Jane","John","Nancy","Peter"],marks=[50,60,40,47,30],age=[15,16,19,18,15])select!(df, Not(setdiff(names(df), [:"name", :"age"])))println(df)
In the code above:
select!()
and setdiff()
to select all columns in the DataFrame except name
and age
, however, Not
negates this, and instead, name
and age
are the only columns returned.Not()
We can also use Not()
to select the necessary columns. In this case, Not()
subsets the columns we don’t need and returns the remaining columns in the DataFrame as below:
using DataFramesdf = DataFrame(student_id=[1,2,3,4,5],name = ["Amy","Jane","John","Nancy","Peter"],marks=[50,60,40,47,30],age=[15,16,19,18,15])df = df[:, Not([:"name", :"age"])]println(df)
In the code above:
df
to the original DataFrame, selecting all columns except for name
and age
using the Not()
method.