How to get the min and max values of an array in Polars
The Polars library is a fast DataFrame library implemented in Rust and designed for performance and ease of use. It provides a data manipulation tool. Polars is particularly efficient for large datasets and parallel computing.
The max and min functions in Polars
In Polars, the Expr.arr.max() and Exp.arr.min() functions are used to compute the maximum and minimum values, respectively, of subarrays within a column of a DataFrame. These functions are part of the expression API in Polars, which allows us to perform various operations on DataFrame columns.
Syntax of the max() function
Here’s the syntax of the max() function:
Exp.arr.max()
Parameters
Expr: It represents a Polars expression, typically a column in a DataFrame.arr: It refers to the array type.max(): It computes the maximum values of the subarrays within the column of a DataFrame.
The Exp.arr.max() function returns the maximum value within subarrays of a column in a DataFrame.
Syntax of the min() function
The syntax of the Exp.arr.min() function is as follows:
Exp.arr.min()
Parameters
Expr: It represents a Polars expression, typically a column in a DataFrame.arr: It refers to the array type.min(): It computes the minimum values of the subarrays within the column of a DataFrame.
The Exp.arr.min() function returns the minimum value within subarrays of a column in a DataFrame.
These functions in Polars are essential for extracting key insights and performing aggregations within subarrays in a DataFrame’s column. By utilizing these functions, analysts and data scientists can efficiently compute the maximum and minimum values within each subarray, facilitating statistical analysis, feature engineering, and data cleaning tasks. These functions are particularly valuable in scenarios where data is organized as arrays, such as stock prices over time, measurements at different timestamps, or temperature readings at various locations.
Code examples
Let’s consider a simple example where we have a DataFrame with a column named a, and we want to find the maximum values from the subarrays given in column a.
import polars as pldf = pl.DataFrame(data={"a": [[34, 3], [23, 2]]},schema={"a": pl.Array(inner=pl.Int64, width=2)},)Max_val = df.select(pl.col("a").arr.max())# Printing valuesprint(Max_val)
Explanation
Let’s discuss the code above step by step:
Lines 3–6: We create a DataFrame
dfusing thepl.DataFrameconstructor. The DataFrame has one column nameda, and the data forais provided as a list of lists ([[34, 3], [23, 2]]). The schema is explicitly defined withpl.Array(inner=pl.Int64, width=2), and specifies that columnaconsists of an array of integers with a width of2.Line 7: We create a new DataFrame
Max_valby selecting theacolumn from the original DataFrame (pl.col("a")) and then finding the maximum value within each array in that column using thearr.max()function.Line 9: We print the DataFrame
Max_val, which contains the maximum value for each array in theacolumn.
Finding minimum values in arrays
Now, we’ll take minimum values from the subarrays. We have a DataFrame with a column named a.
import polars as pldf = pl.DataFrame(data={"a": [[34,3],[23,2]]},schema={"a": pl.Array(inner=pl.Int64, width=2)},)Min_val = df.select(pl.col("a").arr.min())# Printing valuesprint(Min_val)
Here, line 7 will print the minimum value of an array using the Exp.arr.min() function.
Finding the maximum values in arrays across multiple columns
Now, we’ll take the maximum values from an array. We have a DataFrame with two columns named a and b.
import polars as pldf = pl.DataFrame(data={"a": [[1, 2], [4, 3]],"b": [[34,3],[23,2]]},schema={"a": pl.Array(inner=pl.Int64, width=2),"b": pl.Array(inner=pl.Int64, width=2)},)Max_val = df.select(pl.col("a","b").arr.max())print(Max_val)
Explanation
Let’s discuss the code above step by step:
Lines 2–6: We create DataFrame
dfusing thepl.DataFrameconstructor. The DataFrame has two columns,aandb, and the data for both columns is provided as lists of lists ([[1, 2], [4, 3]]foraand[[34, 3], [23, 2]]forb). The schema is explicitly defined for both columns, specifies that columnaand columnbconsist of an array of integers with a width of2.Line 8: We create a new DataFrame
Max_valby selecting bothaandbcolumns from the original DataFrame (pl.col("a", "b")) and then finding the maximum value within each subarray in these columns using thearr.max()function.Line 9: We print the DataFrame
Max_val, which contains the maximum value for each subarray in bothaandbcolumns.
Finding the minimum values in arrays across multiple columns
Now, we’ll take the minimum values from an array. We have a DataFrame with two columns named a and b.
import polars as pldf = pl.DataFrame(data={"a": [[1, 2], [4, 3]],"b": [[34,3],[23,2]]},schema={"a": pl.Array(inner=pl.Int64, width=2),"b": pl.Array(inner=pl.Int64, width=2)},)Min_val = df.select(pl.col("a","b").arr.min())print(Min_val)
The code above is essentially the same as the one in which we found the maximum values from subarrays across multiple columns. Here, line 8 is taking the minimum values from both the columns a and b using the Exp.arr.min() function.
In conclusion, the Exp.arr.min() and Exp.arr.max() functions in Polars are essential tools for data analysis, allowing us to quickly obtain insights into the range of values in our dataset. They are particularly useful when working with large datasets where performance is crucial.
Free Resources