How to use Sklearn to impute missing values

Imputation fills in missing data in a dataset with suitable values. The values can be mean, median, mode, or any constant. Missing data can cause issues in machine learning models, leading to biased or inaccurate results. Therefore, it is an essential task in data pre-processing to impute the data with suitable values.

In this Answer, we will look into how we can impute missing values so that the dataset remains consistent, improving the robustness of the machine learning models and processes.

The `SimpleImputer` method

We will be using SimpleImputer method to achieve this task. In the illustration below, we can see the result of applying imputation on the columns with missing values.

strategy : By default, its value is mean. This defines the type using which the values should be imputed.
- mean: Replaces the missing values with the mean of the column.
- median: Replaces the missing values with the median of the column.
- most_frequent: Replaces the missing values with the most frequently occurring value of the column.
- constant: Replaces the missing values with the constant value defined in the fill_value.
fill_value : It is used when strategy is constant. By default, its value is None. If it is None, then the missing data will be filled with 0 and "missing value" in the case of numerical and string data types, respectively.
missing_values : By default, its value is np.nan. It can also be set to pd.NA. It imputes all of the occurrences of missing_values.
copy : By default, its value is True. If it is True, then a copy of the dataset will be created. Otherwise, the missing values will be filled in place.
add_indicator : By default, its value is False. When set to True, sklearn's imputation process adds missing value indicators to the imputed data, allowing predictive models to consider missingness even if some features had no missing values during training.
keep_empty_features : By default, the value is False. When True, the empty columns are kept, and they are imputed with 0 except when strategy="constant" and the value defined in fill_value is used instead.

Coding example

We will see a step-by-step procedure to impute the missing values. The steps are elaborated in the sections below:

Import dataset

Now we will consider a dataset of car sales that contains missing values in it. We will first import the CSV file and view the number of missing values in each column of the data set.

Code explanation

Line 2: We import SimpleImputer from sklearn.impute module.
Line 9: We define a SimpleImputer with strategy = "mean".
Line 10: We define a SimpleImputer with strategy = "constant" and fill_value="missing".
Line 11: We define a SimpleImputer with strategy = "constant" and fill_value=4.

Apply `ColumnTransform`

Now we will categorize the columns depending on the type of imputation we want to apply. We will use ColumnTransformer to apply the imputation to the columns. The code is given below:

import pandas as pd
from sklearn.impute import SimpleImputer 
from sklearn.compose import ColumnTransformer

car_sales_missing_data = pd.read_csv("car-data.csv")
print("Number of missing values in each column:\n",car_sales_missing_data.isna().sum(),)
numerical_imputer = SimpleImputer(strategy = "mean")
categorical_imputer = SimpleImputer(strategy="constant" , fill_value = "missing")
door_imputer = SimpleImputer(strategy = "constant" , fill_value = 4)

categorical_columns = ["Body Colour" , "Company"]
numerical_columns = ["Odometer (miles)" , "Price"]
door_column = ["Doors"]

transformer = ColumnTransformer([
    ("categorical_imputer" , categorical_imputer , categorical_columns),
    ("numerical_imputer" , numerical_imputer , numerical_columns),
    ("door_imputer" , door_imputer , door_column),
])
filled = transformer.fit_transform(car_sales_missing_data)
car_sales_filled = pd.DataFrame(filled , columns=["Body Colour","Company" , "Odometer (miles)" , "Doors" , "Price"])
print("After imputing:\n",car_sales_filled.isna().sum())

Apply imputation using ColumnTransformer

Code explanation

Line 3: We import ColumnTransformer from sklearn.compose.
Lines 11–13: We categorize the columns in 3 different lists.
Lines 15–19: We define the ColumnTransformer and pass in a list of objects containing the name of the transformer, imputer, and the column's list on which we want to apply imputation.
Line 20: We apply the ColumnTransformer on the dataset using the fit_transform function.
Line 21: We convert the transformed data into a pandas data frame.
Line 22: We print the number of missing values in the data set, which will be zero.

Conclusion

Imputation helps enhance data completeness and improve model accuracy by minimizing data loss. We can use the SimpleImputer function provided by Sklearn and impute the missing data of a dataset according to a strategy.

How to use Sklearn to impute missing values

The `SimpleImputer` method

Syntax

Coding example

Import dataset

Code explanation

Define `SimpleImputer`

Code explanation

Apply `ColumnTransform`

Code explanation

Conclusion

How to use Sklearn to impute missing values

The SimpleImputer method

Syntax

Coding example

Import dataset

Code explanation

Define SimpleImputer

Code explanation

Apply ColumnTransform

Code explanation

Conclusion

The `SimpleImputer` method

Define `SimpleImputer`

Apply `ColumnTransform`