How to use Sklearn to impute missing values

Imputation fills in missing data in a dataset with suitable values. The values can be mean, median, mode, or any constant. Missing data can cause issues in machine learning models, leading to biased or inaccurate results. Therefore, it is an essential task in data pre-processing to impute the data with suitable values.

In this Answer, we will look into how we can impute missing values so that the dataset remains consistent, improving the robustness of the machine learning models and processes.

The SimpleImputer method

We will be using SimpleImputer method to achieve this task. In the illustration below, we can see the result of applying imputation on the columns with missing values.

Applying imputation on missing values
Applying imputation on missing values

Syntax

To use SimpleImputer, we need first to import it using the following command:

from sklearn.impute import SimpleImputer
Command to import SimpleImputer

Once the import has been done, we can define the SimpleImputer method using the following syntax:

SimpleImputer(strategy = "" , fill_value = "" , missing_vaue = "" , copy = bool , add_indicator = bool , keep_empty_feature = bool)
Syntax of SimpleImputer
  • strategy : By default, its value is mean. This defines the type using which the values should be imputed.

    • mean: Replaces the missing values with the mean of the column.

    • median: Replaces the missing values with the median of the column.

    • most_frequent: Replaces the missing values with the most frequently occurring value of the column.

    • constant: Replaces the missing values with the constant value defined in the fill_value.

  • fill_value : It is used when strategy is constant. By default, its value is None. If it is None, then the missing data will be filled with 0 and "missing value" in the case of numerical and string data types, respectively.

  • missing_values : By default, its value is np.nan.  It can also be set to pd.NA. It imputes all of the occurrences of missing_values.

  • copy : By default, its value is True. If it is True, then a copy of the dataset will be created. Otherwise, the missing values will be filled in place.

  • add_indicator : By default, its value is False. When set to True, sklearn's imputation process adds missing value indicators to the imputed data, allowing predictive models to consider missingness even if some features had no missing values during training.

  • keep_empty_features : By default, the value is False. When True, the empty columns are kept, and they are imputed with 0 except when strategy="constant" and the value defined in fill_value is used instead.

Coding example

We will see a step-by-step procedure to impute the missing values. The steps are elaborated in the sections below:

Import dataset

Now we will consider a dataset of car sales that contains missing values in it. We will first import the CSV file and view the number of missing values in each column of the data set.

import pandas as pd

car_sales_missing_data = pd.read_csv("car-data.csv")
print("Number of missing values in each column:\n",car_sales_missing_data.isna().sum(),)

print("First 5 rows of the dataset\n",car_sales_missing_data.head())
Import dataset

Code explanation

  • Line 1: We import the pandas library to read the CSV file.

  • Line 3: We load the CSV file using the read_csv function.

  • Line 4: We print the sum of missing values in each column using the .isna().sum() function.

Define SimpleImputer

Now that we know that the data set contains missing values. We can define the imputation rules for each column depending on the data type. The code can be seen below:

import pandas as pd
from sklearn.impute import SimpleImputer 

car_sales_missing_data = pd.read_csv("car-data.csv")
print("Number of missing values in each column:\n",car_sales_missing_data.isna().sum(),)

print("First 5 rows of the dataset\n",car_sales_missing_data.head())

numerical_imputer = SimpleImputer(strategy = "mean")
categorical_imputer = SimpleImputer(strategy="constant" , fill_value = "missing")
door_imputer = SimpleImputer(strategy = "constant" , fill_value = 4)
Define SimpleImputer

Code explanation

  • Line 2: We import SimpleImputer from sklearn.impute module.

  • Line 9: We define a SimpleImputer with strategy = "mean".

  • Line 10: We define a SimpleImputer with strategy = "constant" and fill_value="missing".

  • Line 11: We define a SimpleImputer with strategy = "constant" and fill_value=4.

Apply ColumnTransform

Now we will categorize the columns depending on the type of imputation we want to apply. We will use ColumnTransformer to apply the imputation to the columns. The code is given below:

import pandas as pd
from sklearn.impute import SimpleImputer 
from sklearn.compose import ColumnTransformer

car_sales_missing_data = pd.read_csv("car-data.csv")
print("Number of missing values in each column:\n",car_sales_missing_data.isna().sum(),)
numerical_imputer = SimpleImputer(strategy = "mean")
categorical_imputer = SimpleImputer(strategy="constant" , fill_value = "missing")
door_imputer = SimpleImputer(strategy = "constant" , fill_value = 4)

categorical_columns = ["Body Colour" , "Company"]
numerical_columns = ["Odometer (miles)" , "Price"]
door_column = ["Doors"]

transformer = ColumnTransformer([
    ("categorical_imputer" , categorical_imputer , categorical_columns),
    ("numerical_imputer" , numerical_imputer , numerical_columns),
    ("door_imputer" , door_imputer , door_column),
])
filled = transformer.fit_transform(car_sales_missing_data)
car_sales_filled = pd.DataFrame(filled , columns=["Body Colour","Company" , "Odometer (miles)" , "Doors" , "Price"])
print("After imputing:\n",car_sales_filled.isna().sum())
Apply imputation using ColumnTransformer

Code explanation

  • Line 3: We import ColumnTransformer from sklearn.compose.

  • Lines 11–13: We categorize the columns in 3 different lists.

  • Lines 15–19: We define the ColumnTransformer and pass in a list of objects containing the name of the transformer, imputer, and the column's list on which we want to apply imputation.

  • Line 20: We apply the ColumnTransformer on the dataset using the fit_transform function.

  • Line 21: We convert the transformed data into a pandas data frame.

  • Line 22: We print the number of missing values in the data set, which will be zero.

Conclusion

Imputation helps enhance data completeness and improve model accuracy by minimizing data loss. We can use the SimpleImputer function provided by Sklearn and impute the missing data of a dataset according to a strategy.

Copyright ©2024 Educative, Inc. All rights reserved