Before learning to perform data cleaning using Polars, we need to understand polars. Polars is a rust-based DataFrame library. It focuses on performance and scalability.
There are many features offered by Polars that are better than other Python libraries like pandas and Apache Spark.
Polars is written in Rust, supporting more parallel operations than other libraries like pandas.
Polars doesn’t use an index for the data frame, which makes it easier to manipulate the data.
Polars represents data using Apache arrows, which is more efficient in computation and memory usage.
Let’s create a DataFrame of rows and columns using some random choices to clean it using Polars.
This is how to create a data frame using polar:
import polars as plimport random# Define the schema with data types for each columnschema = [("int_column", pl.Int32),("float_column", pl.Float32),("bool_column", pl.Boolean),("str_column", pl.Utf8),]# Generate random data for each columndata = {"int_column": [random.randint(1, 10) for _ in range(10)],"float_column": [random.uniform(1.0, 10.0) for _ in range(10)],"bool_column": [random.choice([True, False]) for _ in range(10)],"str_column": [random.choice(['A', 'B', 'C']) for _ in range(10)],}# Create the DataFramedf = pl.DataFrame(data, schema=schema)# Display the DataFrameprint(df)
In the above code:
Lines 1–2: It imports random
to generate random values and polars
to create a DataFrame.
Lines 5–10: It sets data types for all the columns.
Lines 13–18: It generates random data into data
where:
int_column
contains random integer values between 1
and 10
.float_column
contains random floating-point values between 1.0
and 10.0
.bool_column
contains random boolean values.str_column
contains random strings selected from the set ['A', 'B', 'C']
.Line 21: It creates a data frame df
.
Line 24: It prints the data frame df
.
There are some basic data-cleaning tasks we can perform on our data using Polars.
Normalization: We should normalize our data before cleaning it. It will remove the outliers from our data, which might change the final outcome of our analysis.
Remove duplicates: We can remove duplicates from data using unique()
.
df.unique(subset=["column1", "column2"])
. If we want to keep any specific duplicate in the dataset, we can also specify it using keep
like df.unique(subset=["column1", "column2"], keep='first')
. We can pass first
and last
to keep first and last duplicate.Handle missing values: We can handle missing data (None
or NULL
values) using the fillna()
or drop_nulls()
method.
Data filtering: We can filter data by specifying conditions for data in filter()
.
String operations: We can use string operations like str_contains()
, str_replace()
, and str_split()
to clean string data.
Let’s go through our dataset and perform these functions.
import polars as plimport random# Display the original DataFrameprint("Original DataFrame:")print(df)# Clean the data using the specified functions# 1. Remove duplicatesdf = df.unique()# 2. Handle missing values by replacing them with 0 for int and float columnsdf = df.with_columns(pl.when(df['int_column'].is_null()).then(0).otherwise(df['int_column']).alias("int_column"),pl.when(df['float_column'].is_null()).then(0.0).otherwise(df['float_column']).alias("float_column"))# 3. Data filtering: Keep only rows where int_column is greater than 5df = df.filter(pl.col('int_column') > 5)# 4. String operations: Convert str_column to lowercase and rename columndf = df.with_columns(df['str_column'].str.to_lowercase().alias("str_column"))# Display the cleaned DataFrameprint("\nCleaned DataFrame:")print(df)
In the above code:
Lines 1–2: It imports random
to generate random values and polar
to create a DataFrame.
Lines 5–6: It prints the original DataFrame using print()
method.
Line 11: It keeps only unique values and ignores the duplicates in the DataFrame.
Lines 14–17: It fills the missing values with 0
and 0.0
.
Line 20: It filters out the part of the DataFrame where int_column > 5
and stores it in our DataFrame.
Lines 23–25: It converts the string values to lowercase in str_column
.
Lines 28–29: It prints the cleaned DataFrame using print()
method.