Polars DataFrame is a data structure for efficient, fast data manipulation, similar to pandas, written in Rust.
What is the DataFrame.partition_by() method in Polars?
Key takeaways:
DataFrame.partition_by()in Polars splits DataFrames based on column values.It accepts up to five parameters:
by,more_by,maintain_order,include_key, andas_dict.The result can be a list or dictionary of partitioned DataFrames.
Using
maintain_order=Falserandomizes the order of the data in the resulting partitions.Setting
as_dict=Truereturns the partitioned DataFrames as a dictionary.Partitioning by multiple columns creates more refined partitions based on the unique combinations of values.
The function is efficient for data processing, filtering, and parallel tasks on large datasets.
Polars is a library written in Rust, inspired by pandas, for efficient and fast data frame manipulation. DataFrame.partition_by() is a new function implemented in the library for creating separate DataFrames, based on column value. Let’s look into the details of the function.
The DataFrame.partition_by() function
This function takes a maximum of 5 parameters and returns a list or a dictionary.
Syntax
df.partition_by(by, more_by, maintain_order, include_key, as_dict)
Parameters
by: This parameter specifies the column name to group the dataset.more_by: This is an optional argument specifying additional column names to group the dataset.maintain_order: This is an optional argument ensuring the result is in the same order as the input data. The defaultboolvalue is True.include_key: This is an optional argument specifying whether to include the column(s) used to group by. The defaultboolvalue is True.as_dict: This is an optional argument specifying whether to return the result as a dictionary. The defaultboolvalue isFalse.
Returns
list: A list of data frames partitioned by the specified column name.dict: A dictionary of DataFrames partitioned by the specified column name.
Code
Let’s start by importing the Polars library.
import polars as pl
Next, we can define a simple data frame about the different types of fruits in a supermarket and how ripe they are.
Fruits | Level of Ripeness |
Apples | 1 |
Grapes | 2 |
Bananas | 2 |
Apples | 3 |
Bananas | 1 |
Grapes | 3 |
Make a DataFrame for this table using the Polars library.
df = pl.DataFrame({"Fruits": ["Apples", "Grapes" , "Bananas", "Apples", "Bananas", "Grapes"],"Level of Ripeness": [1, 2, 2, 3, 1, 3]})
Let’s see how the df.partition_by() will work if we partition it by the “Fruits” column.
#Import polars library as plimport polars as pl# Create our DataFramedf = pl.DataFrame({"Fruits": ["Apples", "Grapes" , "Bananas", "Apples", "Bananas", "Grapes"],"Level of Ripeness": [1, 2, 2, 3, 1, 3]})# Partition the dataframe based on "Fruits"print("Dataset after partioning: ")partioned_df = df.partition_by("Fruits")print(partioned_df)
Explanation
Here is a line-by-line breakdown of the code above.
Line 2: We import the
polarlibrary aspl.Lines 5–8: Here, we create our DataFrame
dfusing the functionpl.dataframeprovided bypolarslibrary. We give the function adictionaryas input.Line 14: This line partitions the dataset into three different DataFrames since we are partitioning the dataset based on the number of unique values in the column
"Fruits". Therefore, our result is three different data frames with one unique fruit.
Let’s look at the impact of changing the function parameters.
Using maintain_order = False
In the code below, we will partition the dataset by the “Fruits” column, along with making maintain_order as False.
partioned_df = df.partition_by("Fruits", maintain_order = False)
Due to this, the order of the column “Fruits” is not maintained, and the resulting data frames are in random order.
Using as_dict = True
In the code below, we will partition the dataset by the “Fruits” column, along with making as_dict as True.
partioned_df = df.partition_by(['Fruits'], as_dict = True)
If we want to return our data frames in the form of a dictionary, we can make the as_dict parameter as True . However, due to a deprecation warning, we input our column as a list.
Partitioning by multiple columns
To do this, let’s append another Price column to our data frame. We have to do this because if we partition a dataset containing two columns into two columns, our resulting answer will be a list of empty DataFrames.
Therefore, let’s add another column called “Price,” which contains the price of the fruit based on its ripeness level.
#Import polars library as plimport polars as pl# Create our DataFramedf = pl.DataFrame({"Fruits": ["Apples", "Grapes" , "Bananas", "Apples", "Bananas", "Grapes"],"Level of Ripeness": [1, 2, 2, 3, 1, 3],"Price in $": [4, 5, 2, 1, 3, 2]})# Partition the dataframe based on "Fruits"print("Dataset after partioning: ")partioned_df = df.partition_by('Fruits', 'Level of Ripeness', include_key = False)print(partioned_df)
The partitioned data frames have increased to 6, as the data frame is grouped by the “Fruits” column and the “Level of Ripeness” column. Moreover, since we made include_key = False , the result does not include the “Fruits” and “Level of Ripeness” columns.
Conclusion
The df.partition_by() is a helpful function used in data processing tasks such as data filtering, grouping, aggregation, parallel processing, and more. It is simple to understand, efficient, and works for large datasets!
Frequently asked questions
Haven’t found what you were looking for? Contact Us
What is Polars DataFrame?
Are Polars faster than pandas?
How to define a schema in Polars?
Are Polars DataFrames immutable?
Free Resources