In Polars, we can drop duplicates from a DataFrame using the unique()
method that finds unique values in a DataFrame based on one or more columns. This method helps us identify and retain only the distinct rows in a DataFrame. We can specify the columns for which we want to find unique values. In this Answer, we will explore the unique()
method with code examples.
DataFrame.unique(subset, keep, maintain_order)
subset
: This parameter specifies the column(s) based on which we want to find unique values. We can pass a single column name as a string or a list of column names if we want to consider multiple columns when determining uniqueness (optional).
keep
: This parameter determines which occurrences to keep when there are duplicates. This is an optional parameter. It accepts the following options:
'first'
: Keeps the first occurrence of each unique row
'last'
: Keeps the last occurrence of each unique row
'any'
: No promise about keeping a particular row
'none'
: Doesn’t keep the same rows twice
maintain_order
: This parameter keeps the same order as the original DataFrame. Its value may be true
or false
(optional).
The unique()
method returns a new DataFrame with unique values based on specified columns.
Here is an example of using the unique()
method to drop duplicates from the DataFrame:
import polars as pldata = {'Name': ['Alice', 'Bob', 'Alice', 'Charlie', 'Bob'],'Age': [22, 31, 22, 31, 31],'City': ['New York', 'San Francisco', 'New York', 'Los Angeles', 'San Francisco']}df = pl.DataFrame(data)# Original dataframeprint("Dataframe")print(df)# Remove duplicates based on specific columnnew_df = df.unique(subset=['Age'])print("Remove duplicates based on specific column")print(new_df)# Remove duplicates on all columnsnew_df1 = df.unique()print("Remove duplicates on all columns")print(new_df1)
Line 1: We import the required polars
library.
Lines 3–8: We create the DataFrame named df
, which includes three columns named 'Name'
, 'Age'
, and 'City'
.
Line 11: We print the original DataFrame df
.
Lines 14–16: We remove duplicates based on a specific column. We apply the unique()
method to the DataFrame df
, and set the subset
parameter to ['Age']
, indicating that duplicates should be removed based on the Age
column. We store the results in the new_df
DataFrame, which only contains unique rows based on the Age
column.
Lines 19–21: Here, we remove duplicates from the entire DataFrame by calling the unique()
method without specifying a subset
. This means that it considers all columns for determining uniqueness. The resulting DataFrame is stored in new_df1
, which contains only the unique rows based on all columns.
Free Resources