Imputation fills in missing data in a dataset with suitable values. The values can be mean, median, mode, or any constant. Missing data can cause issues in machine learning models, leading to biased or inaccurate results. Therefore, it is an essential task in data pre-processing to impute the data with suitable values.
In this Answer, we will look into how we can impute missing values so that the dataset remains consistent, improving the robustness of the machine learning models and processes.
SimpleImputer
methodWe will be using SimpleImputer
method to achieve this task. In the illustration below, we can see the result of applying imputation on the columns with missing values.
To use SimpleImputer
, we need first to import it using the following command:
from sklearn.impute import SimpleImputer
Once the import has been done, we can define the SimpleImputer
method using the following syntax:
SimpleImputer(strategy = "" , fill_value = "" , missing_vaue = "" , copy = bool , add_indicator = bool , keep_empty_feature = bool)
strategy
: By default, its value is mean
. This defines the type using which the values should be imputed.
mean
: Replaces the missing values with the mean of the column.
median
: Replaces the missing values with the median of the column.
most_frequent
: Replaces the missing values with the most frequently occurring value of the column.
constant
: Replaces the missing values with the constant value defined in the fill_value
.
fill_value
: It is used when strategy
is constant
. By default, its value is None
. If it is None
, then the missing data will be filled with 0 and "missing value" in the case of numerical and string data types, respectively.
missing_values
: By default, its value is np.nan
. It can also be set to pd.NA
. It imputes all of the occurrences of missing_values
.
copy
: By default, its value is True
. If it is True
, then a copy of the dataset will be created. Otherwise, the missing values will be filled in place.
add_indicator
: By default, its value is False
. When set to True
, sklearn's imputation process adds missing value indicators to the imputed data, allowing predictive models to consider missingness even if some features had no missing values during training.
keep_empty_features
: By default, the value is False
. When True
, the empty columns are kept, and they are imputed with 0
except when strategy="constant"
and the value defined in fill_value
is used instead.
We will see a step-by-step procedure to impute the missing values. The steps are elaborated in the sections below:
Now we will consider a dataset of car sales that contains missing values in it. We will first import the CSV file and view the number of missing values in each column of the data set.
import pandas as pd car_sales_missing_data = pd.read_csv("car-data.csv") print("Number of missing values in each column:\n",car_sales_missing_data.isna().sum(),) print("First 5 rows of the dataset\n",car_sales_missing_data.head())
Line 1: We import the pandas
library to read the CSV file.
Line 3: We load the CSV file using the read_csv
function.
Line 4: We print the sum of missing values in each column using the .isna().sum()
function.
SimpleImputer
Now that we know that the data set contains missing values. We can define the imputation rules for each column depending on the data type. The code can be seen below:
import pandas as pd from sklearn.impute import SimpleImputer car_sales_missing_data = pd.read_csv("car-data.csv") print("Number of missing values in each column:\n",car_sales_missing_data.isna().sum(),) print("First 5 rows of the dataset\n",car_sales_missing_data.head()) numerical_imputer = SimpleImputer(strategy = "mean") categorical_imputer = SimpleImputer(strategy="constant" , fill_value = "missing") door_imputer = SimpleImputer(strategy = "constant" , fill_value = 4)
Line 2: We import SimpleImputer
from sklearn.impute
module.
Line 9: We define a SimpleImputer
with strategy = "mean"
.
Line 10: We define a SimpleImputer
with strategy = "constant"
and fill_value="missing"
.
Line 11: We define a SimpleImputer
with strategy = "constant"
and fill_value=4
.
ColumnTransform
Now we will categorize the columns depending on the type of imputation we want to apply. We will use ColumnTransformer
to apply the imputation to the columns. The code is given below:
import pandas as pd from sklearn.impute import SimpleImputer from sklearn.compose import ColumnTransformer car_sales_missing_data = pd.read_csv("car-data.csv") print("Number of missing values in each column:\n",car_sales_missing_data.isna().sum(),) numerical_imputer = SimpleImputer(strategy = "mean") categorical_imputer = SimpleImputer(strategy="constant" , fill_value = "missing") door_imputer = SimpleImputer(strategy = "constant" , fill_value = 4) categorical_columns = ["Body Colour" , "Company"] numerical_columns = ["Odometer (miles)" , "Price"] door_column = ["Doors"] transformer = ColumnTransformer([ ("categorical_imputer" , categorical_imputer , categorical_columns), ("numerical_imputer" , numerical_imputer , numerical_columns), ("door_imputer" , door_imputer , door_column), ]) filled = transformer.fit_transform(car_sales_missing_data) car_sales_filled = pd.DataFrame(filled , columns=["Body Colour","Company" , "Odometer (miles)" , "Doors" , "Price"]) print("After imputing:\n",car_sales_filled.isna().sum())
Line 3: We import ColumnTransformer
from sklearn.compose
.
Lines 11–13: We categorize the columns in 3 different lists.
Lines 15–19: We define the ColumnTransformer
and pass in a list of objects containing the name of the transformer, imputer, and the column's list on which we want to apply imputation.
Line 20: We apply the ColumnTransformer
on the dataset using the fit_transform
function.
Line 21: We convert the transformed data into a pandas data frame.
Line 22: We print the number of missing values in the data set, which will be zero.
Imputation helps enhance data completeness and improve model accuracy by minimizing data loss. We can use the SimpleImputer
function provided by Sklearn and impute the missing data of a dataset according to a strategy.