The with_columns
function introduced in Polars provides a convenient way to add new columns to a DataFrame without creating an entirely new copy of the existing data. It’s part of the DataFrame API designed to efficiently manipulate and transform tabular data. This function is useful for extending the functionality of our DataFrame by adding calculated or derived columns.
with_columns
The with_columns
function is defined as follows:
DataFrame.with_columns(*exprs: IntoExpr | Iterable[IntoExpr], **named_exprs: IntoExpr)
exprs
: The exprs
parameter represents the columns to be added, which are specified as positional arguments. It accepts expression input, where strings are interpreted as column names, and other non-expression inputs are interpreted as literals.
named_exprs
: The named_exprs
parameter represents additional columns to be added, specified as keyword arguments. These columns will be renamed according to the keywords provided.
The function returns a new DataFrame with the specified columns added.
with_columns
First, let’s look at a simple example of using with_columns
function:
import polars as pl# Creating DataFramedata = pl.DataFrame({"alpha": [4, 6, 8, 10],"beta": [5, 4.8, 10.2, 20],"gamma": [True, False, False, True],})# Using .with_columns to add a new columnnew_dataFrame = data.with_columns((pl.col("alpha") ** 3).alias("alpha^3"))# Printing the valuesprint (new_dataFrame)
Let’s discuss the code step-by-step:
Lines 3–9: We create a DataFrame
named data
with three columns named as alpha
, beta
, and gamma
. The alpha
column contains integer values, the beta
column contains float values, and the gammma
column contains boolean values.
Line 11: We add a new column, which will calculate the cube of the column alpha
, using with_column
function.
Line 14: We print the DataFrame
with the added new column.
Second, let’s take a look at another complex code that add multiple columns using .with_columns
.
import polars as pl# Creating DataFramedata = pl.DataFrame({"alpha":[4, 6, 8, 10],"beta": [5, 4.8, 10.2, 20],"gamma":[True, False, False, True],})# Adding multiple columnsnew_dataFrame = data.with_columns([(pl.col("alpha") ** 3).alias("alpha^3"),(pl.col("beta") * 3).alias("beta*3"),(pl.col("gamma").not_()).alias("not gamma"),])# Printing the valuesprint (new_dataFrame)
Let's discuss the code step by step:
Lines 3–9: We create a DataFrame
named data
with three columns named as alpha
, beta
, and gamma
. The alpha
column contains integer values, the beta
column contains float values, and the gamma
column contains boolean values.
Lines 11–17: We add three new columns, which will calculate the cube of the column alpha
, multiplication of column beta
, and not of column gamma
. Then, assigning it to variable new_dataFrame
.
Line 14: We print the DataFrame
with the added columns.
At the end, let’s explore expressions with multiple outputs. These can be automatically transformed into Structs
by enabling the setting Config.set_auto_structify(True)
:
import polars as pl# Creating DataFramedata = pl.DataFrame({"alpha": [4, 6, 8, 10],"beta": [5, 4.8, 10.2, 20],"gamma": [True, False, False, True],})with pl.Config(auto_structify=True):new_dataFrame = data.drop("gamma").with_columns(diffs=pl.col(["alpha", "beta"]).diff().name.suffix("_diff"),)# Printing the valuesprint (new_dataFrame)
Let’s discuss the code step-by-step:
Lines 3–9: We create a DataFrame
named data
with three columns named as alpha
, beta
, and gamma
. The alpha
column contains integer values, the beta
column contains float values, and the gamma
column contains boolean values.
Lines 11–14: We use a with
block to set a Polars configuration option (auto_structify
) to True
. Inside the block, a new DataFrame named new_dataFrame
is created by first dropping the column c
and then adding a new column named diffs
. The diffs
column is created by taking the differences between the values in columns alpha
and beta
, and the column names are suffixed with _diff
.
Line 14: We print the DataFrame
with the added new column.
The with_columns
function in Polars is a powerful tool for extending the functionality of DataFrames by adding new columns. It provides a flexible and concise syntax for expressing column additions, and it's particularly useful when we want to enrich the data with calculated or derived values. Remember that employing this approach doesn’t generate a duplicate of the current data, making it a streamlined method to improve the DataFrame
.
Free Resources