How to drop duplicates from a DataFrame in Polars

In Polars, we can drop duplicates from a DataFrame using the unique() method that finds unique values in a DataFrame based on one or more columns. This method helps us identify and retain only the distinct rows in a DataFrame. We can specify the columns for which we want to find unique values. In this Answer, we will explore the unique() method with code examples.

Syntax

Here is the syntax of the unique() method:

DataFrame.unique(subset, keep, maintain_order)

Parameters

  • subset: This parameter specifies the column(s) based on which we want to find unique values. We can pass a single column name as a string or a list of column names if we want to consider multiple columns when determining uniqueness (optional).

  • keep: This parameter determines which occurrences to keep when there are duplicates. This is an optional parameter. It accepts the following options:

    • 'first': Keeps the first occurrence of each unique row

    • 'last': Keeps the last occurrence of each unique row

    • 'any': No promise about keeping a particular row

    • 'none': Doesn’t keep the same rows twice

  • maintain_order: This parameter keeps the same order as the original DataFrame. Its value may be true or false (optional).

Return value

The unique() method returns a new DataFrame with unique values based on specified columns.

Example

Here is an example of using the unique() method to drop duplicates from the DataFrame:

import polars as pl
data = {
'Name': ['Alice', 'Bob', 'Alice', 'Charlie', 'Bob'],
'Age': [22, 31, 22, 31, 31],
'City': ['New York', 'San Francisco', 'New York', 'Los Angeles', 'San Francisco']
}
df = pl.DataFrame(data)
# Original dataframe
print("Dataframe")
print(df)
# Remove duplicates based on specific column
new_df = df.unique(subset=['Age'])
print("Remove duplicates based on specific column")
print(new_df)
# Remove duplicates on all columns
new_df1 = df.unique()
print("Remove duplicates on all columns")
print(new_df1)

Explanation

  • Line 1: We import the required polars library.

  • Lines 3–8: We create the DataFrame named df, which includes three columns named 'Name', 'Age', and 'City'.

  • Line 11: We print the original DataFrame df.

  • Lines 14–16: We remove duplicates based on a specific column. We apply the unique() method to the DataFrame df, and set the subset parameter to ['Age'], indicating that duplicates should be removed based on the Age column. We store the results in the new_df DataFrame, which only contains unique rows based on the Age column.

  • Lines 19–21: Here, we remove duplicates from the entire DataFrame by calling the unique() method without specifying a subset. This means that it considers all columns for determining uniqueness. The resulting DataFrame is stored in new_df1, which contains only the unique rows based on all columns.

Free Resources

Copyright ©2024 Educative, Inc. All rights reserved