How to drop duplicates from a DataFrame in Polars
In Polars, we can drop duplicates from a DataFrame using the unique() method that finds unique values in a DataFrame based on one or more columns. This method helps us identify and retain only the distinct rows in a DataFrame. We can specify the columns for which we want to find unique values. In this Answer, we will explore the unique() method with code examples.
Syntax
DataFrame.unique(subset, keep, maintain_order)
Parameters
subset: This parameter specifies the column(s) based on which we want to find unique values. We can pass a single column name as a string or a list of column names if we want to consider multiple columns when determining uniqueness (optional).keep: This parameter determines which occurrences to keep when there are duplicates. This is an optional parameter. It accepts the following options:'first': Keeps the first occurrence of each unique row'last': Keeps the last occurrence of each unique row'any': No promise about keeping a particular row'none': Doesn’t keep the same rows twice
maintain_order: This parameter keeps the same order as the original DataFrame. Its value may betrueorfalse(optional).
Return value
The unique() method returns a new DataFrame with unique values based on specified columns.
Example
Here is an example of using the unique() method to drop duplicates from the DataFrame:
import polars as pldata = {'Name': ['Alice', 'Bob', 'Alice', 'Charlie', 'Bob'],'Age': [22, 31, 22, 31, 31],'City': ['New York', 'San Francisco', 'New York', 'Los Angeles', 'San Francisco']}df = pl.DataFrame(data)# Original dataframeprint("Dataframe")print(df)# Remove duplicates based on specific columnnew_df = df.unique(subset=['Age'])print("Remove duplicates based on specific column")print(new_df)# Remove duplicates on all columnsnew_df1 = df.unique()print("Remove duplicates on all columns")print(new_df1)
Explanation
Line 1: We import the required
polarslibrary.Lines 3–8: We create the DataFrame named
df, which includes three columns named'Name','Age', and'City'.Line 11: We print the original DataFrame
df.Lines 14–16: We remove duplicates based on a specific column. We apply the
unique()method to the DataFramedf, and set thesubsetparameter to['Age'], indicating that duplicates should be removed based on theAgecolumn. We store the results in thenew_dfDataFrame, which only contains unique rows based on theAgecolumn.Lines 19–21: Here, we remove duplicates from the entire DataFrame by calling the
unique()method without specifying asubset. This means that it considers all columns for determining uniqueness. The resulting DataFrame is stored innew_df1, which contains only the unique rows based on all columns.
Free Resources