How to perform data cleaning using Polars

Polars

Before learning to perform data cleaning using Polars, we need to understand polars. Polars is a rust-based DataFrame library. It focuses on performance and scalability.

Polars vs. other libraries

There are many features offered by Polars that are better than other Python libraries like pandas and Apache Spark.

  • Polars is written in Rust, supporting more parallel operations than other libraries like pandas.

  • Polars doesn’t use an index for the data frame, which makes it easier to manipulate the data.

  • Polars represents data using Apache arrows, which is more efficient in computation and memory usage.

Data cleaning using Polars

Let’s create a DataFrame of 1010 rows and 55 columns using some random choices to clean it using Polars.

Creating a DataFrame

This is how to create a data frame using polar:

import polars as pl
import random
# Define the schema with data types for each column
schema = [
("int_column", pl.Int32),
("float_column", pl.Float32),
("bool_column", pl.Boolean),
("str_column", pl.Utf8),
]
# Generate random data for each column
data = {
"int_column": [random.randint(1, 10) for _ in range(10)],
"float_column": [random.uniform(1.0, 10.0) for _ in range(10)],
"bool_column": [random.choice([True, False]) for _ in range(10)],
"str_column": [random.choice(['A', 'B', 'C']) for _ in range(10)],
}
# Create the DataFrame
df = pl.DataFrame(data, schema=schema)
# Display the DataFrame
print(df)

Explanation

In the above code:

  • Lines 1–2: It imports random to generate random values and polars to create a DataFrame.

  • Lines 5–10: It sets data types for all the columns.

  • Lines 13–18: It generates random data into data where:

    • int_column contains random integer values between 1 and 10.
    • float_column contains random floating-point values between 1.0 and 10.0.
    • bool_column contains random boolean values.
    • str_column contains random strings selected from the set ['A', 'B', 'C'].
  • Line 21: It creates a data frame df.

  • Line 24: It prints the data frame df.

Cleaning data

There are some basic data-cleaning tasks we can perform on our data using Polars.

  • Normalization: We should normalize our data before cleaning it. It will remove the outliers from our data, which might change the final outcome of our analysis.

  • Remove duplicates: We can remove duplicates from data using unique().

    • If we want to remove duplicates for specific columns, we can pass parameters like df.unique(subset=["column1", "column2"]). If we want to keep any specific duplicate in the dataset, we can also specify it using keep like df.unique(subset=["column1", "column2"], keep='first'). We can pass first and last to keep first and last duplicate.
  • Handle missing values: We can handle missing data (None or NULL values) using the fillna() or drop_nulls() method.

  • Data filtering: We can filter data by specifying conditions for data in filter().

  • String operations: We can use string operations like str_contains(), str_replace(), and str_split() to clean string data.

Let’s go through our dataset and perform these functions.

import polars as pl
import random
# Display the original DataFrame
print("Original DataFrame:")
print(df)
# Clean the data using the specified functions
# 1. Remove duplicates
df = df.unique()
# 2. Handle missing values by replacing them with 0 for int and float columns
df = df.with_columns(
pl.when(df['int_column'].is_null()).then(0).otherwise(df['int_column']).alias("int_column"),
pl.when(df['float_column'].is_null()).then(0.0).otherwise(df['float_column']).alias("float_column")
)
# 3. Data filtering: Keep only rows where int_column is greater than 5
df = df.filter(pl.col('int_column') > 5)
# 4. String operations: Convert str_column to lowercase and rename column
df = df.with_columns(
df['str_column'].str.to_lowercase().alias("str_column")
)
# Display the cleaned DataFrame
print("\nCleaned DataFrame:")
print(df)

Explanation

In the above code:

  • Lines 1–2: It imports random to generate random values and polar to create a DataFrame.

  • Lines 5–6: It prints the original DataFrame using print() method.

  • Line 11: It keeps only unique values and ignores the duplicates in the DataFrame.

  • Lines 14–17: It fills the missing values with 0 and 0.0.

  • Line 20: It filters out the part of the DataFrame where int_column > 5 and stores it in our DataFrame.

  • Lines 23–25: It converts the string values to lowercase in str_column.

  • Lines 28–29: It prints the cleaned DataFrame using print() method.

Copyright ©2024 Educative, Inc. All rights reserved