How to use Sklearn to impute missing values
Imputation fills in missing data in a dataset with suitable values. The values can be mean, median, mode, or any constant. Missing data can cause issues in machine learning models, leading to biased or inaccurate results. Therefore, it is an essential task in data pre-processing to impute the data with suitable values.
In this Answer, we will look into how we can impute missing values so that the dataset remains consistent, improving the robustness of the machine learning models and processes.
The SimpleImputer method
We will be using SimpleImputer method to achieve this task. In the illustration below, we can see the result of applying imputation on the columns with missing values.
Syntax
To use SimpleImputer, we need first to import it using the following command:
from sklearn.impute import SimpleImputer
Once the import has been done, we can define the SimpleImputer method using the following syntax:
SimpleImputer(strategy = "" , fill_value = "" , missing_vaue = "" , copy = bool , add_indicator = bool , keep_empty_feature = bool)
strategy: By default, its value ismean. This defines the type using which the values should be imputed.mean: Replaces the missing values with the mean of the column.median: Replaces the missing values with the median of the column.most_frequent: Replaces the missing values with the most frequently occurring value of the column.constant: Replaces the missing values with the constant value defined in thefill_value.
fill_value: It is used whenstrategyisconstant. By default, its value isNone. If it isNone, then the missing data will be filled with 0 and "missing value" in the case of numerical and string data types, respectively.missing_values: By default, its value isnp.nan. It can also be set topd.NA. It imputes all of the occurrences ofmissing_values.copy: By default, its value isTrue. If it isTrue, then a copy of the dataset will be created. Otherwise, the missing values will be filled in place.add_indicator: By default, its value isFalse. When set toTrue, sklearn's imputation process adds missing value indicators to the imputed data, allowing predictive models to consider missingness even if some features had no missing values during training.keep_empty_features: By default, the value isFalse. WhenTrue, the empty columns are kept, and they are imputed with0except whenstrategy="constant"and the value defined infill_valueis used instead.
Coding example
We will see a step-by-step procedure to impute the missing values. The steps are elaborated in the sections below:
Import dataset
Now we will consider a dataset of car sales that contains missing values in it. We will first import the CSV file and view the number of missing values in each column of the data set.
import pandas as pd
car_sales_missing_data = pd.read_csv("car-data.csv")
print("Number of missing values in each column:\n",car_sales_missing_data.isna().sum(),)
print("First 5 rows of the dataset\n",car_sales_missing_data.head())Code explanation
Line 1: We import the
pandaslibrary to read the CSV file.Line 3: We load the CSV file using the
read_csvfunction.Line 4: We print the sum of missing values in each column using the
.isna().sum()function.
Define SimpleImputer
Now that we know that the data set contains missing values. We can define the imputation rules for each column depending on the data type. The code can be seen below:
import pandas as pd
from sklearn.impute import SimpleImputer
car_sales_missing_data = pd.read_csv("car-data.csv")
print("Number of missing values in each column:\n",car_sales_missing_data.isna().sum(),)
print("First 5 rows of the dataset\n",car_sales_missing_data.head())
numerical_imputer = SimpleImputer(strategy = "mean")
categorical_imputer = SimpleImputer(strategy="constant" , fill_value = "missing")
door_imputer = SimpleImputer(strategy = "constant" , fill_value = 4)Code explanation
Line 2: We import
SimpleImputerfromsklearn.imputemodule.Line 9: We define a
SimpleImputerwithstrategy = "mean".Line 10: We define a
SimpleImputerwithstrategy = "constant"andfill_value="missing".Line 11: We define a
SimpleImputerwithstrategy = "constant"andfill_value=4.
Apply ColumnTransform
Now we will categorize the columns depending on the type of imputation we want to apply. We will use ColumnTransformer to apply the imputation to the columns. The code is given below:
import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
car_sales_missing_data = pd.read_csv("car-data.csv")
print("Number of missing values in each column:\n",car_sales_missing_data.isna().sum(),)
numerical_imputer = SimpleImputer(strategy = "mean")
categorical_imputer = SimpleImputer(strategy="constant" , fill_value = "missing")
door_imputer = SimpleImputer(strategy = "constant" , fill_value = 4)
categorical_columns = ["Body Colour" , "Company"]
numerical_columns = ["Odometer (miles)" , "Price"]
door_column = ["Doors"]
transformer = ColumnTransformer([
("categorical_imputer" , categorical_imputer , categorical_columns),
("numerical_imputer" , numerical_imputer , numerical_columns),
("door_imputer" , door_imputer , door_column),
])
filled = transformer.fit_transform(car_sales_missing_data)
car_sales_filled = pd.DataFrame(filled , columns=["Body Colour","Company" , "Odometer (miles)" , "Doors" , "Price"])
print("After imputing:\n",car_sales_filled.isna().sum())Code explanation
Line 3: We import
ColumnTransformerfromsklearn.compose.Lines 11–13: We categorize the columns in 3 different lists.
Lines 15–19: We define the
ColumnTransformerand pass in a list of objects containing the name of the transformer, imputer, and the column's list on which we want to apply imputation.Line 20: We apply the
ColumnTransformeron the dataset using thefit_transformfunction.Line 21: We convert the transformed data into a pandas data frame.
Line 22: We print the number of missing values in the data set, which will be zero.
Conclusion
Imputation helps enhance data completeness and improve model accuracy by minimizing data loss. We can use the SimpleImputer function provided by Sklearn and impute the missing data of a dataset according to a strategy.
Free Resources