How to select a subset of DataFrame columns in Julia

DataFrame is one of the most popular data structures that helps users manipulate data easily. When we read data into a DataFrame, it will be structured with columns and rows, making it easy to analyze and work with.

In Julia, several ways exist to select only a subset of DataFrame columns, which we will cover in this Answer.

Method 1: Using column names

We can select a subset of columns using their actual column names, as shown below:

df = df[:,[:"A",:"B"]]

The above code selects columns with names A and B from df.

Example

using DataFrames
df = DataFrame(student_id=[1,2,3,4,5],
name = ["Amy","Jane","John","Nancy","Peter"],
marks=[50,60,40,47,30],
age=[15,16,19,18,15])
df = df[:,[:"name",:"age"]]
println(df)

Explanation

Let’s explain the code provided above.

  • Line 1: We upload the already imported library DataFrames.

  • Lines 2–5: We create a DataFrame consisting of four columns and five rows, each containing students’ information.

  • Line 7: We select the DataFrame columns name and age only and assign the DataFrame to a new one named df.

  • Line 8: We print the new DataFrame.

Method 2: Using column index

We can select a subset of columns by specifying their index numbers. Here’s an example:

df = df[:,[1,3]]

The code df = df[:, [1, 3]] selects the columns with index 1 and 3 from the DataFrame df. The resulting DataFrame will only contain those selected columns, creating a subset of the original DataFrame.

Example

using DataFrames
df = DataFrame(student_id=[1,2,3,4,5],
name = ["Amy","Jane","John","Nancy","Peter"],
marks=[50,60,40,47,30],
age=[15,16,19,18,15])
df = df[:,[1,3]]
println(df)

Explanation

  • Line 7: We select the columns at index 1 (student_id) and 3 (marks) and return a new DataFrame with only these columns. We assign this DataFrame to a new one also named df.

Method 3: Using select() or select!()

We can also use select() or select!() functions to select a subset of DataFrame columns, as explained below.

Option 1

select!(df, [:"A", :"B"]))

The select!() function selects the columns A and B and then modifies the original DataFrame,df. This is referred to as modifying in place.

Option 2

df = select(df,[:"A",:"B"])

The select() function selects columns A and B from the original DataFrame and creates a copy. We can assign this new DataFrame to a separate variable named df.

Example

using DataFrames
df = DataFrame(student_id=[1,2,3,4,5],
name = ["Amy","Jane","John","Nancy","Peter"],
marks=[50,60,40,47,30],
age=[15,16,19,18,15])
#using select
df1 = select(df,[:"student_id",:"marks"])
println(df1)
println("-------------------------")
#using select!
select!(df, [:"name", :"age"])
println(df)

Explanations

Let’s explain the code provided above.

  • Lines 8–9: We use select() to subset the columns and assign the new DataFrame to a variable named df1 and then we print out df1.

  • Lines 13–14: We use select!() to select columns name and age. select!() modifies the original DataFrame, df, so no variable assignment is needed. We then print out the new df.

Method 4: Using boolean indexing

We can use boolean indexing, where we specify True or False values, to subset columns in a DataFrame.

df = df[:,[true,false]]

The code above selects 1 out of the 2 columns of the DataFrame.

Example

using DataFrames
df = DataFrame(student_id=[1,2,3,4,5],
name = ["Amy","Jane","John","Nancy","Peter"],
marks=[50,60,40,47,30],
age=[15,16,19,18,15])
df= df[:,[true,false,true,true]]
println(df)

Explanation

  • Line 7: We use boolean indexing to select three columns, where true returns the column and false omits the column. Consequently, we choose only the columns student_id, marks, and age. The resulting DataFrame is then assigned to a new variable, also named df.

Free Resources

Copyright ©2024 Educative, Inc. All rights reserved