How to remove columns by name in Julia DataFrame

Julia is a powerful data science language known for its robust capabilities in numerical computations.

While working with DataFrames, data manipulation is inevitable. One such data manipulation is removing columns that aren’t needed in the analysis.

There are several ways of removing columns by name in a Julia DataFrame, and in this Answer, we will review a few ways to do so.

Method 1: Using the select!() statement

We can use the select!() and Not statements to specify which columns to remove. The Not statement selects all columns except the ones specified. Using the select!() statement, the original DataFrame is modified directly. This is referred to as modification in place.

Here’s the syntax for using the select!() method:

select!(df,Not([:A,:B]))

However, select() can still be used, but this method creates a copy of the original DataFrame and changes it. This method will need a variable to be assigned to it.

We can use the select() method in the following way:

df = select(df,Not([:A,:B]))

Let’s understand the select!() method using the code example below.

Example

using DataFrames
df = DataFrame(student_id=[1,2,3,4,5],
name = ["Amy","Jane","John","Nancy","Peter"],
marks=[50,60,40,47,30],
age=[15,16,19,18,15])
select!(df, Not([:"name", :"age"]))
println(df)

Let’s explain the code provided above.

  • Line 1: We use the already imported DataFrames library.

  • Lines 2–5: We create a DataFrame with four columns, namely student_id, name, marks, and age, and five rows, where each row represents students’ information.

  • Line 6: We use select!() to select all columns in the DataFrame except name and age.

  • Line 7: We print out the modified DataFrame.

Method 2: Using select!() and setdiff()

The select!() and setdiff() are used in the following way:

select!(df, Not(setdiff(names(df), [:A, :B])))

The setdiff() function returns the set difference between two arrays, so setdiff(names(df), [:A, :B]) returns the names of all columns in df except for A and B. The Not function negates the selection, so the example above only returns columns A and B. After it, select!() modifies the original DataFrame.

Example

using DataFrames
df = DataFrame(student_id=[1,2,3,4,5],
name = ["Amy","Jane","John","Nancy","Peter"],
marks=[50,60,40,47,30],
age=[15,16,19,18,15])
select!(df, Not(setdiff(names(df), [:"name", :"age"])))
println(df)

In the code above:

  • Line 7: We use select!() and setdiff() to select all columns in the DataFrame except name and age, however, Not negates this, and instead, name and age are the only columns returned.

Method 3: Using Not()

We can also use Not() to select the necessary columns. In this case, Not() subsets the columns we don’t need and returns the remaining columns in the DataFrame as below:

Example

using DataFrames
df = DataFrame(student_id=[1,2,3,4,5],
name = ["Amy","Jane","John","Nancy","Peter"],
marks=[50,60,40,47,30],
age=[15,16,19,18,15])
df = df[:, Not([:"name", :"age"])]
println(df)

In the code above:

  • Line 7: We assign df to the original DataFrame, selecting all columns except for name and age using the Not() method.
Copyright ©2024 Educative, Inc. All rights reserved