How to get unique values from an array in Polars
Polars is a versatile data manipulation library in Python designed for efficient data processing and analysis. One of the powerful features provided by Polars is the ability to obtain unique values from the arrays. This functionality is particularly useful in scenarios where we need to identify and extract distinct elements from an array in Polars DataFrame. In this Answer, we will discuss the Expr.arr.unique() method to fulfill such scenarios.
The Exp.arr.unique() method
The Exp.arr.unique() method is designed to retrieve the unique or distinct values from an array in Polars DataFrame. By invoking this method on a DataFrame expression, we can obtain a new expression representing the array containing only the unique values.
Syntax
Here’s the syntax of the Expr.arr.unique() method:
Expr.arr.unique(*, maintain_order: bool = False)
*shows that the arguments passed after the*must be specified using keyword arguments.maintain_orderis an optional boolean parameter, which, if set toTrue, preserves the order of the unique values in the result. It’s default value isFalse.
Code
Let’s walk through a practical example to understand the usage of the Expr.arr.unique() method:
import polars as pl# Create a DataFrame with an array columndf = pl.DataFrame({"a": [[1, 2, 3, 2]],}, schema_overrides={"a": pl.Array(width=4, inner=pl.Int64)})# Use arr.unique() to get unique values from the arrayunique_values_expr = df.select(pl.col("a").arr.unique())# Display the resultprint(unique_values_expr)
Explanation
Lines 4–6: We’re creating a DataFrame
dfwithasingle column. The columnacontains a single row with the[1, 2, 3, 2]array. Theschema_overridesparameter is used to specify the schema of the DataFrame explicitly. In this case, it specifies that the columnais an array of width4(i.e., it should contain four elements), where each element is of theInt64type.Line 9: The
select()method is used to create a new DataFrame (unique_values_expr) by selecting the unique values of theacolumn. Thepl.col("a")method retrieves the columnafrom the DataFrame, and then.arr.unique()is used to obtain the unique values in that array.Line 12: Finally, the unique values DataFrame (
unique_values_expr) is printed to the console.
This will display the unique values present in the array column a of the original df DataFrame. Note that this will be a DataFrame with a single column containing the unique values of the array.
Finding unique values in multiple array columns
Let’s add more columns in a DataFrame and find unique values from an array. Here’s how we can do it:
import polars as pl# Create a DataFrame with an array columnsdf = pl.DataFrame({"a": [[1, 2, 3, 2]],"b": [[3, 4, 3, 7]],"c": [[8, 12, 13, 12]],}, schema_overrides={"a": pl.Array(width=4, inner=pl.Int64),"b": pl.Array(width=4, inner=pl.Int64),"c": pl.Array(width=4, inner=pl.Int64),})# Use arr.unique() to get unique values from the array columnsunique_values_expr = df.select(pl.col("a", "b", "c").arr.unique())# Display the resultprint(unique_values_expr)
Explanation
Lines 4–12: We’re creating DataFrame
dfwith multiple columns nameda,b, andc. Theschema_overridesparameter is used to specify the schema of the DataFrame explicitly.Line 15: We’re using the
select()method to create a new DataFrame (unique_values_expr) by selecting the unique values from thea,b, andcarray columns. Thepl.col("a", "b", "c")invocation retrieves the columnsa,b, andcfrom thedfDataFrame, and then.arr.unique()is used to obtain the unique values from the arrays.
This will display the unique values present in the array columns.
Wrap up
The Expr.arr.unique() method in Polar provides a convenient and efficient way to extract unique values from array columns in DataFrames. By understanding its method signature, parameters, and usage through practical examples, we can leverage this functionality to enhance data manipulation and analysis workflows in Polars.
Free Resources