Mean encoding in Python
Mean encoding is a technique for transforming categorical variables into numerical values based on the mean of the target variable. It’s particularly useful in classification problems, where the goal is to predict a target variable based on input features.
How does it work?
Consider a dataset like the one below with a categorical feature “City” and a binary target variable “Churn” indicating whether a customer churned.
CustomerID | City | Churn |
1 | New York | 0 |
2 | Paris | 1 |
3 | New York | 1 |
4 | Tokyo | 1 |
5 | Paris | 1 |
Step 1: Group by category (City)
Group by
Cityto get mean values forChurn.
Step 2: Calculate mean
Calculate the mean of
Churnfor each city.Mean(
Churn) for New York = (0 + 1) / 2 = 0.5Mean(
Churn) for Paris = (1 + 1) / 2 = 1Mean(
Churn) for Tokyo = 1
Step 3: Replace categories
Replace the
Cityvalues with their meanChurnvalues.
CustomerID | City (Mean Encoded) | Churn |
1 | 0.5 | 0 |
2 | 1 | 1 |
3 | 0.5 | 1 |
4 | 1 | 1 |
5 | 1 | 1 |
In this way, the City variable is encoded with the mean Churn values, providing a numerical representation of the categorical feature for machine learning models.
How do we implement it?
Now, we will look at implementing mean encoding in Python.
Import the libraries
The first step is to import the required libraries.
import pandas as pd
Create a simple DataFrame
In this step, we will create a simple DataFrame. We can also import our dataset.
data = {'City': ['New York', 'Paris', 'New York', 'Tokyo', 'Paris'],'Churn': [0, 1, 1, 1, 1]}df = pd.DataFrame(data)
Define mean encoding function
We will define a function (mean_encode) to perform mean encoding on a given data frame for a specified categorical and target feature.
def mean_encode(df, cat_feature, target_feature):mean_encoding = df.groupby(cat_feature)[target_feature].mean()df[cat_feature + '_mean_encoded'] = df[cat_feature].map(mean_encoding)
Apply mean encoding
Now we will apply the mean_encode function to the data frame, specifying the categorical and target features.
mean_encode(df, 'City', 'Churn')
Example
The following code shows how we can implement mean encoding in Python:
import pandas as pd# Given DataFramedata = {'City': ['New York', 'Paris', 'New York', 'Tokyo', 'Paris'],'Churn': [0, 1, 1, 1, 1]}df = pd.DataFrame(data)def mean_encode(df, cat_feature, target_feature):mean_encoding = df.groupby(cat_feature)[target_feature].mean()df[cat_feature + '_mean_encoded'] = df[cat_feature].map(mean_encoding)mean_encode(df, 'City', 'Churn')print(df.head())
Explanation
Line 1: Import the
pandaslibrary for data manipulation.Lines 4–6: Create a sample data frame (
df). The data frame has two columns:CityandChurn.Lines 8–10: This code computes the mean of the
target_featuregrouped by unique values in thecat_featurecolumn and creates a new column by mapping these mean values to corresponding categories in the data framedf.Line 12: Here we call the
mean_encodefunction with the data framedf, specifyingCityas the categorical feature andChurnas the target variable.
Conclusion
Mean encoding in Python is a technique used to represent categorical variables numerically by replacing their values with the mean of the target variable. This helps capture relationships between categories and the target variable, making it useful for machine learning models.
Free Resources