Mean encoding in Python

Mean encoding is a technique for transforming categorical variables into numerical values based on the mean of the target variable. It’s particularly useful in classification problems, where the goal is to predict a target variable based on input features.

How does it work?

Consider a dataset like the one below with a categorical feature “City” and a binary target variable “Churn” indicating whether a customer churned.

CustomerID

City

Churn

1

New York

0

2

Paris

1

3

New York

1

4

Tokyo

1

5

Paris

1

Step 1: Group by category (City)

  • Group by City to get mean values for Churn.

Step 2: Calculate mean

  • Calculate the mean of Churn for each city.

    • Mean(Churn) for New York = (0 + 1) / 2 = 0.5

    • Mean(Churn) for Paris = (1 + 1) / 2 = 1

    • Mean(Churn) for Tokyo = 1

Step 3: Replace categories

  • Replace the City values with their mean Churn values.

CustomerID

City (Mean Encoded)

Churn

1

0.5

0

2

1

1

3

0.5

1

4

1

1

5

1

1

In this way, the City variable is encoded with the mean Churn values, providing a numerical representation of the categorical feature for machine learning models.

How do we implement it?

Now, we will look at implementing mean encoding in Python.

Import the libraries

The first step is to import the required libraries.

import pandas as pd

Create a simple DataFrame

In this step, we will create a simple DataFrame. We can also import our dataset.

data = {'City': ['New York', 'Paris', 'New York', 'Tokyo', 'Paris'],
'Churn': [0, 1, 1, 1, 1]}
df = pd.DataFrame(data)

Define mean encoding function

We will define a function (mean_encode) to perform mean encoding on a given data frame for a specified categorical and target feature.

def mean_encode(df, cat_feature, target_feature):
mean_encoding = df.groupby(cat_feature)[target_feature].mean()
df[cat_feature + '_mean_encoded'] = df[cat_feature].map(mean_encoding)

Apply mean encoding

Now we will apply the mean_encode function to the data frame, specifying the categorical and target features.

mean_encode(df, 'City', 'Churn')

Example

The following code shows how we can implement mean encoding in Python:

import pandas as pd
# Given DataFrame
data = {'City': ['New York', 'Paris', 'New York', 'Tokyo', 'Paris'],
'Churn': [0, 1, 1, 1, 1]}
df = pd.DataFrame(data)
def mean_encode(df, cat_feature, target_feature):
mean_encoding = df.groupby(cat_feature)[target_feature].mean()
df[cat_feature + '_mean_encoded'] = df[cat_feature].map(mean_encoding)
mean_encode(df, 'City', 'Churn')
print(df.head())

Explanation

  • Line 1: Import the pandas library for data manipulation.

  • Lines 4–6: Create a sample data frame (df). The data frame has two columns: City and Churn.

  • Lines 8–10: This code computes the mean of the target_feature grouped by unique values in the cat_feature column and creates a new column by mapping these mean values to corresponding categories in the data frame df.

  • Line 12: Here we call the mean_encode function with the data frame df, specifying City as the categorical feature and Churn as the target variable.

Conclusion

Mean encoding in Python is a technique used to represent categorical variables numerically by replacing their values with the mean of the target variable. This helps capture relationships between categories and the target variable, making it useful for machine learning models.

Free Resources

Copyright ©2025 Educative, Inc. All rights reserved