How to perform data cleaning using Polars
Polars
Before learning to perform data cleaning using Polars, we need to understand polars. Polars is a rust-based DataFrame library. It focuses on performance and scalability.
Polars vs. other libraries
There are many features offered by Polars that are better than other Python libraries like pandas and Apache Spark.
-
Polars is written in Rust, supporting more parallel operations than other libraries like pandas.
-
Polars doesn’t use an index for the data frame, which makes it easier to manipulate the data.
-
Polars represents data using Apache arrows, which is more efficient in computation and memory usage.
Data cleaning using Polars
Let’s create a DataFrame of rows and columns using some random choices to clean it using Polars.
Creating a DataFrame
This is how to create a data frame using polar:
import polars as plimport random# Define the schema with data types for each columnschema = [("int_column", pl.Int32),("float_column", pl.Float32),("bool_column", pl.Boolean),("str_column", pl.Utf8),]# Generate random data for each columndata = {"int_column": [random.randint(1, 10) for _ in range(10)],"float_column": [random.uniform(1.0, 10.0) for _ in range(10)],"bool_column": [random.choice([True, False]) for _ in range(10)],"str_column": [random.choice(['A', 'B', 'C']) for _ in range(10)],}# Create the DataFramedf = pl.DataFrame(data, schema=schema)# Display the DataFrameprint(df)
Explanation
In the above code:
-
Lines 1–2: It imports
randomto generate random values andpolarsto create a DataFrame. -
Lines 5–10: It sets data types for all the columns.
-
Lines 13–18: It generates random data into
datawhere:int_columncontains random integer values between1and10.float_columncontains random floating-point values between1.0and10.0.bool_columncontains random boolean values.str_columncontains random strings selected from the set['A', 'B', 'C'].
-
Line 21: It creates a data frame
df. -
Line 24: It prints the data frame
df.
Cleaning data
There are some basic data-cleaning tasks we can perform on our data using Polars.
-
Normalization: We should normalize our data before cleaning it. It will remove the outliers from our data, which might change the final outcome of our analysis.
-
Remove duplicates: We can remove duplicates from data using
unique().- If we want to remove duplicates for specific columns, we can pass parameters like
df.unique(subset=["column1", "column2"]). If we want to keep any specific duplicate in the dataset, we can also specify it usingkeeplikedf.unique(subset=["column1", "column2"], keep='first'). We can passfirstandlastto keep first and last duplicate.
- If we want to remove duplicates for specific columns, we can pass parameters like
-
Handle missing values: We can handle missing data (
NoneorNULLvalues) using thefillna()ordrop_nulls()method. -
Data filtering: We can filter data by specifying conditions for data in
filter(). -
String operations: We can use string operations like
str_contains(),str_replace(), andstr_split()to clean string data.
Let’s go through our dataset and perform these functions.
import polars as plimport random# Display the original DataFrameprint("Original DataFrame:")print(df)# Clean the data using the specified functions# 1. Remove duplicatesdf = df.unique()# 2. Handle missing values by replacing them with 0 for int and float columnsdf = df.with_columns(pl.when(df['int_column'].is_null()).then(0).otherwise(df['int_column']).alias("int_column"),pl.when(df['float_column'].is_null()).then(0.0).otherwise(df['float_column']).alias("float_column"))# 3. Data filtering: Keep only rows where int_column is greater than 5df = df.filter(pl.col('int_column') > 5)# 4. String operations: Convert str_column to lowercase and rename columndf = df.with_columns(df['str_column'].str.to_lowercase().alias("str_column"))# Display the cleaned DataFrameprint("\nCleaned DataFrame:")print(df)
Explanation
In the above code:
-
Lines 1–2: It imports
randomto generate random values andpolarto create a DataFrame. -
Lines 5–6: It prints the original DataFrame using
print()method. -
Line 11: It keeps only unique values and ignores the duplicates in the DataFrame.
-
Lines 14–17: It fills the missing values with
0and0.0. -
Line 20: It filters out the part of the DataFrame where
int_column > 5and stores it in our DataFrame. -
Lines 23–25: It converts the string values to lowercase in
str_column. -
Lines 28–29: It prints the cleaned DataFrame using
print()method.
Free Resources