How to select a subset of DataFrame columns in Julia
DataFrame is one of the most popular data structures that helps users manipulate data easily. When we read data into a DataFrame, it will be structured with columns and rows, making it easy to analyze and work with.
In Julia, several ways exist to select only a subset of DataFrame columns, which we will cover in this Answer.
Method 1: Using column names
We can select a subset of columns using their actual column names, as shown below:
df = df[:,[:"A",:"B"]]
The above code selects columns with names A and B from df.
Example
using DataFramesdf = DataFrame(student_id=[1,2,3,4,5],name = ["Amy","Jane","John","Nancy","Peter"],marks=[50,60,40,47,30],age=[15,16,19,18,15])df = df[:,[:"name",:"age"]]println(df)
Explanation
Let’s explain the code provided above.
-
Line 1: We upload the already imported library
DataFrames. -
Lines 2–5: We create a
DataFrameconsisting of four columns and five rows, each containing students’ information. -
Line 7: We select the DataFrame columns
nameandageonly and assign the DataFrame to a new one nameddf. -
Line 8: We print the new DataFrame.
Method 2: Using column index
We can select a subset of columns by specifying their index numbers. Here’s an example:
df = df[:,[1,3]]
The code df = df[:, [1, 3]] selects the columns with index 1 and 3 from the DataFrame df. The resulting DataFrame will only contain those selected columns, creating a subset of the original DataFrame.
Example
using DataFramesdf = DataFrame(student_id=[1,2,3,4,5],name = ["Amy","Jane","John","Nancy","Peter"],marks=[50,60,40,47,30],age=[15,16,19,18,15])df = df[:,[1,3]]println(df)
Explanation
- Line 7: We select the columns at index 1 (
student_id) and 3 (marks) and return a new DataFrame with only these columns. We assign this DataFrame to a new one also nameddf.
Method 3: Using select() or select!()
We can also use select() or select!() functions to select a subset of DataFrame columns, as explained below.
Option 1
select!(df, [:"A", :"B"]))
The select!() function selects the columns A and B and then modifies the original DataFrame,df. This is referred to as modifying in place.
Option 2
df = select(df,[:"A",:"B"])
The select() function selects columns A and B from the original DataFrame and creates a copy. We can assign this new DataFrame to a separate variable named df.
Example
using DataFramesdf = DataFrame(student_id=[1,2,3,4,5],name = ["Amy","Jane","John","Nancy","Peter"],marks=[50,60,40,47,30],age=[15,16,19,18,15])#using selectdf1 = select(df,[:"student_id",:"marks"])println(df1)println("-------------------------")#using select!select!(df, [:"name", :"age"])println(df)
Explanations
Let’s explain the code provided above.
-
Lines 8–9: We use
select()to subset the columns and assign the new DataFrame to a variable nameddf1and then we print outdf1. -
Lines 13–14: We use
select!()to select columnsnameandage.select!()modifies the original DataFrame,df, so no variable assignment is needed. We then print out the newdf.
Method 4: Using boolean indexing
We can use boolean indexing, where we specify True or False values, to subset columns in a DataFrame.
df = df[:,[true,false]]
The code above selects 1 out of the 2 columns of the DataFrame.
Example
using DataFramesdf = DataFrame(student_id=[1,2,3,4,5],name = ["Amy","Jane","John","Nancy","Peter"],marks=[50,60,40,47,30],age=[15,16,19,18,15])df= df[:,[true,false,true,true]]println(df)
Explanation
Line 7: We use boolean indexing to select three columns, where
truereturns the column andfalseomits the column. Consequently, we choose only the columnsstudent_id,marks, andage. The resulting DataFrame is then assigned to a new variable, also nameddf.
Free Resources