What is DataFrame.groupby in Polars?

Polars is a fast DataFrame library implemented in Rust with bindings for Python. It is a data manipulation library used for processing large datasets. It is similar to pandas but optimized for performance and parallel operations processing, making it well-suited for big data processing tasks. Polars supports data from various sources, including CSV, ParquetColumnar storage file format, ArrowIn-memory columnar storage file format, and more.

Note: We will use the version 3.6 of Python.

We can import the polars library in our Python script or notebook, as shown below:

import polars as pl

We’ll go through the groupby() method of the polars library.

The `groupby()` method

The groupby() method, available in data manipulation libraries like pandas and polars, allows us to group rows of a DataFrame based on the unique values in one or more columns. With the help of the groupby() method, we can group data according to categories and then independently apply functions to the categories.

Here is the syntax for using the groupby() method:

# importing polars
import polars as pl
data = {
    "id": [1, 2, 3],
    "grade": ["A", "B", "B"]
}
df = pl.DataFrame(data)
# grouping w.r.t column
for name, data in df.groupby("grade"):  
    print(name)
    print(data)

In the above code, we iterate through groups formed by the groupby operation based on the unique values in the grade column. In this case, the groups are formed for unique values A and B.

Operations on the `groupby()` method

There’s a list of operations we can apply to the grouped data. Let’s see the examples of a few of them.

Maximum

We can find the maximum of the grouped data using the groupby.max() function of the polars library. This way, we can reduce our groups to show only maximum values.

# importing polars
import polars as pl
data = {
    "x": [10, 20, 30, 40, 50, 60],
    "y": [0.1, 0.2, 0.5, 1.0, 2.0, 3.0],
    "z": [False, True, False, False, True, True],
    "w": ["Red", "Blue", "Red", "Green", "Green", "Blue"]
}
df = pl.DataFrame(data)
# fetching the maximum value
result = df.groupby("w", maintain_order=True).max()
print(result)

Minimum

We can find the minimum of the grouped data using the groupby.min() function. This way, we can reduce our groups to show only minimum values.

# importing polars
import polars as pl
data = {
    "x": [10, 20, 30, 40, 50, 60],
    "y": [0.1, 0.2, 0.5, 1.0, 2.0, 3.0],
    "z": [False, True, False, False, True, True],
    "w": ["Red", "Blue", "Red", "Green", "Green", "Blue"]
}
df = pl.DataFrame(data)
# fetching the minimum value
result = df.groupby("w", maintain_order=True).min()
print(result)

Sum

We can find the sum of the grouped data using the groupby.sum() function. This way, we can reduce our groups to show the sum of the values.

# importing polars
import polars as pl
data = {
    "x": [10, 20, 30, 40, 50, 60],
    "y": [0.1, 0.2, 0.5, 1.0, 2.0, 3.0],
    "z": [False, True, False, False, True, True],
    "w": ["Red", "Blue", "Red", "Green", "Green", "Blue"]
}
df = pl.DataFrame(data)
# fetching the sum of the values
result = df.groupby("w", maintain_order=True).sum()
print(result)

Conclusion

We have explored a few examples, but there are many more methods like aggregate, mean, median, tail, quantile, etc. The DataFrame.groupby method is a powerful function that allows us to group data efficiently and provide us with various operations on the grouped data.

Relevant Answers

Explore Courses

Free Resources