How to get the min and max values of an array in Polars

The Polars library is a fast DataFrame library implemented in Rust and designed for performance and ease of use. It provides a data manipulation tool. Polars is particularly efficient for large datasets and parallel computing.

The max and min functions in Polars

In Polars, the Expr.arr.max() and Exp.arr.min() functions are used to compute the maximum and minimum values, respectively, of subarrays within a column of a DataFrame. These functions are part of the expression API in Polars, which allows us to perform various operations on DataFrame columns.

Syntax of the max() function

Here’s the syntax of the max() function:

Exp.arr.max()

Parameters

  • Expr: It represents a Polars expression, typically a column in a DataFrame.

  • arr: It refers to the array type.

  • max(): It computes the maximum values of the subarrays within the column of a DataFrame.

The Exp.arr.max() function returns the maximum value within subarrays of a column in a DataFrame.

Syntax of the min() function

The syntax of the Exp.arr.min() function is as follows:

Exp.arr.min()

Parameters

  • Expr: It represents a Polars expression, typically a column in a DataFrame.

  • arr: It refers to the array type.

  • min(): It computes the minimum values of the subarrays within the column of a DataFrame.

The Exp.arr.min() function returns the minimum value within subarrays of a column in a DataFrame.

These functions in Polars are essential for extracting key insights and performing aggregations within subarrays in a DataFrame’s column. By utilizing these functions, analysts and data scientists can efficiently compute the maximum and minimum values within each subarray, facilitating statistical analysis, feature engineering, and data cleaning tasks. These functions are particularly valuable in scenarios where data is organized as arrays, such as stock prices over time, measurements at different timestamps, or temperature readings at various locations.

Code examples

Let’s consider a simple example where we have a DataFrame with a column named a, and we want to find the maximum values from the subarrays given in column a.

import polars as pl
df = pl.DataFrame(
data={"a": [[34, 3], [23, 2]]},
schema={"a": pl.Array(inner=pl.Int64, width=2)},
)
Max_val = df.select(pl.col("a").arr.max())
# Printing values
print(Max_val)

Explanation

Let’s discuss the code above step by step:

  • Lines 3–6: We create a DataFrame df using the pl.DataFrame constructor. The DataFrame has one column named a, and the data for a is provided as a list of lists ([[34, 3], [23, 2]]). The schema is explicitly defined with pl.Array(inner=pl.Int64, width=2), and specifies that column a consists of an array of integers with a width of 2.

  • Line 7: We create a new DataFrame Max_val by selecting the a column from the original DataFrame (pl.col("a")) and then finding the maximum value within each array in that column using the arr.max() function.

  • Line 9: We print the DataFrame Max_val, which contains the maximum value for each array in the a column.

Finding minimum values in arrays

Now, we’ll take minimum values from the subarrays. We have a DataFrame with a column named a.

import polars as pl
df = pl.DataFrame(
data={"a": [[34,3],[23,2]]},
schema={"a": pl.Array(inner=pl.Int64, width=2)},
)
Min_val = df.select(pl.col("a").arr.min())
# Printing values
print(Min_val)

Here, line 7 will print the minimum value of an array using the Exp.arr.min() function.

Finding the maximum values in arrays across multiple columns

Now, we’ll take the maximum values from an array. We have a DataFrame with two columns named a and b.

import polars as pl
df = pl.DataFrame(
data={"a": [[1, 2], [4, 3]],"b": [[34,3],[23,2]]},
schema={"a": pl.Array(inner=pl.Int64, width=2),
"b": pl.Array(inner=pl.Int64, width=2)},
)
Max_val = df.select(pl.col("a","b").arr.max())
print(Max_val)

Explanation

Let’s discuss the code above step by step:

  • Lines 2–6: We create DataFrame df using the pl.DataFrame constructor. The DataFrame has two columns, a and b, and the data for both columns is provided as lists of lists ([[1, 2], [4, 3]] for a and [[34, 3], [23, 2]] for b). The schema is explicitly defined for both columns, specifies that column a and column b consist of an array of integers with a width of 2.

  • Line 8: We create a new DataFrame Max_val by selecting both a and b columns from the original DataFrame (pl.col("a", "b")) and then finding the maximum value within each subarray in these columns using the arr.max() function.

  • Line 9: We print the DataFrame Max_val, which contains the maximum value for each subarray in both a and b columns.

Finding the minimum values in arrays across multiple columns

Now, we’ll take the minimum values from an array. We have a DataFrame with two columns named a and b.

import polars as pl
df = pl.DataFrame(
data={"a": [[1, 2], [4, 3]],"b": [[34,3],[23,2]]},
schema={"a": pl.Array(inner=pl.Int64, width=2),
"b": pl.Array(inner=pl.Int64, width=2)},
)
Min_val = df.select(pl.col("a","b").arr.min())
print(Min_val)

The code above is essentially the same as the one in which we found the maximum values from subarrays across multiple columns. Here, line 8 is taking the minimum values from both the columns a and b using the Exp.arr.min() function.

In conclusion, the Exp.arr.min() and Exp.arr.max() functions in Polars are essential tools for data analysis, allowing us to quickly obtain insights into the range of values in our dataset. They are particularly useful when working with large datasets where performance is crucial.

Copyright ©2024 Educative, Inc. All rights reserved