How to drop duplicates from a DataFrame in Polars

Parameters

subset: This parameter specifies the column(s) based on which we want to find unique values. We can pass a single column name as a string or a list of column names if we want to consider multiple columns when determining uniqueness (optional).
keep: This parameter determines which occurrences to keep when there are duplicates. This is an optional parameter. It accepts the following options:
- 'first': Keeps the first occurrence of each unique row
- 'last': Keeps the last occurrence of each unique row
- 'any': No promise about keeping a particular row
- 'none': Doesn’t keep the same rows twice
maintain_order: This parameter keeps the same order as the original DataFrame. Its value may be true or false (optional).

Return value

The unique() method returns a new DataFrame with unique values based on specified columns.

Example

Here is an example of using the unique() method to drop duplicates from the DataFrame:

import polars as pl
data = {
    'Name': ['Alice', 'Bob', 'Alice', 'Charlie', 'Bob'],
    'Age': [22, 31, 22, 31, 31],
    'City': ['New York', 'San Francisco', 'New York', 'Los Angeles', 'San Francisco']
}
df = pl.DataFrame(data)
# Original dataframe
print("Dataframe")
print(df)
# Remove duplicates based on specific column
new_df = df.unique(subset=['Age'])
print("Remove duplicates based on specific column")
print(new_df)
# Remove duplicates on all columns
new_df1 = df.unique()
print("Remove duplicates on all columns")
print(new_df1)

Explanation

Line 1: We import the required polars library.
Lines 3–8: We create the DataFrame named df, which includes three columns named 'Name', 'Age', and 'City'.
Line 11: We print the original DataFrame df.
Lines 14–16: We remove duplicates based on a specific column. We apply the unique() method to the DataFrame df, and set the subset parameter to ['Age'], indicating that duplicates should be removed based on the Age column. We store the results in the new_df DataFrame, which only contains unique rows based on the Age column.
Lines 19–21: Here, we remove duplicates from the entire DataFrame by calling the unique() method without specifying a subset. This means that it considers all columns for determining uniqueness. The resulting DataFrame is stored in new_df1, which contains only the unique rows based on all columns.

Free AI Mock Interviews

Coding Interview

Coding PatternsFree Interview

Gain insights and practical experience with coding patterns through targeted MCQs and coding problems, designed to match and challenge your expertise level.

System Design

YouTubeFree Interview

Learn to design a video streaming platform like YouTube by tackling functional and non-functional requirements, core components, and high-level to detailed design challenges.

Free Resources

How to drop duplicates from a DataFrame in Polars

Syntax

Parameters

Return value

Example

Explanation