What is DataFrame.groupby in Polars?
Polars is a fast DataFrame library implemented in Rust with bindings for Python. It is a data manipulation library used for processing large datasets. It is similar to pandas but optimized for performance and parallel operations processing, making it well-suited for big data processing tasks. Polars supports data from various sources, including CSV,
Note: We will use the version 3.6 of Python.
We can import the polars library in our Python script or notebook, as shown below:
import polars as pl
We’ll go through the groupby() method of the polars library.
The groupby() method
The groupby() method, available in data manipulation libraries like pandas and polars, allows us to group rows of a DataFrame based on the unique values in one or more columns. With the help of the groupby() method, we can group data according to categories and then independently apply functions to the categories.
Here is the syntax for using the groupby() method:
# importing polarsimport polars as pldata = {"id": [1, 2, 3],"grade": ["A", "B", "B"]}df = pl.DataFrame(data)# grouping w.r.t columnfor name, data in df.groupby("grade"):print(name)print(data)
In the above code, we iterate through groups formed by the groupby operation based on the unique values in the grade column. In this case, the groups are formed for unique values A and B.
Operations on the groupby() method
There’s a list of operations we can apply to the grouped data. Let’s see the examples of a few of them.
Maximum
We can find the maximum of the grouped data using the groupby.max() function of the polars library. This way, we can reduce our groups to show only maximum values.
# importing polarsimport polars as pldata = {"x": [10, 20, 30, 40, 50, 60],"y": [0.1, 0.2, 0.5, 1.0, 2.0, 3.0],"z": [False, True, False, False, True, True],"w": ["Red", "Blue", "Red", "Green", "Green", "Blue"]}df = pl.DataFrame(data)# fetching the maximum valueresult = df.groupby("w", maintain_order=True).max()print(result)
Minimum
We can find the minimum of the grouped data using the groupby.min() function. This way, we can reduce our groups to show only minimum values.
# importing polarsimport polars as pldata = {"x": [10, 20, 30, 40, 50, 60],"y": [0.1, 0.2, 0.5, 1.0, 2.0, 3.0],"z": [False, True, False, False, True, True],"w": ["Red", "Blue", "Red", "Green", "Green", "Blue"]}df = pl.DataFrame(data)# fetching the minimum valueresult = df.groupby("w", maintain_order=True).min()print(result)
Sum
We can find the sum of the grouped data using the groupby.sum() function. This way, we can reduce our groups to show the sum of the values.
# importing polarsimport polars as pldata = {"x": [10, 20, 30, 40, 50, 60],"y": [0.1, 0.2, 0.5, 1.0, 2.0, 3.0],"z": [False, True, False, False, True, True],"w": ["Red", "Blue", "Red", "Green", "Green", "Blue"]}df = pl.DataFrame(data)# fetching the sum of the valuesresult = df.groupby("w", maintain_order=True).sum()print(result)
Conclusion
We have explored a few examples, but there are many more methods like aggregate, mean, median, tail, quantile, etc. The DataFrame.groupby method is a powerful function that allows us to group data efficiently and provide us with various operations on the grouped data.
Free Resources