Customer churn rate is predicted by analyzing past customer behavior, transaction history, and engagement patterns. Techniques like logistic regression, decision trees, or machine learning models assess factors such as usage frequency, complaints, and demographic data to determine the likelihood of a customer leaving.
How to perform feature engineering for customer churn prediction
Key takeaways:
Customer churn prediction aims to identify customers likely to discontinue using a product or service by analyzing historical data to find patterns and behaviors that precede churn.
Feature engineering is crucial. Transforming raw data into meaningful features improves model performance and helps identify key attributes influencing customer churn.
Accurate churn prediction allows businesses to implement targeted retention strategies, such as personalized offers or improved customer service.
Initially, we might assume that predicting customer churn is as simple as applying machine learning algorithms to existing data. However, the real power lies in feature engineering—the process of transforming raw data into meaningful features that help improve model performance. Feature engineering plays a pivotal role in identifying key attributes that influence customer churn, and it’s crucial to not only define which type of feature engineering is being applied but also clarify the expected output.
Customer churn prediction is the task of identifying customers who are likely to stop using a product or service in the future. First, we look at past customer data to identify patterns and behaviors that often occur before a customer leaves (churns). Next, we pinpoint the specific actions or changes that signal a customer might be at risk of leaving. Finally, we use machine learning algorithms to predict which customers are likely to churn based on these identified patterns and behaviors.
The goal of customer churn prediction is to take proactive measures to retain customers before they churn, such as targeted marketing campaigns, personalized offers, or improved customer service. By accurately predicting customer churn, businesses can reduce customer attrition, increase customer satisfaction, and ultimately improve their bottom line.
Guide to perform feature engineering
Performing data preparation and feature engineering for customer churn prediction involves several steps. Here’s the step-by-step process for customer churn prediction:
Import libraries
We import essential libraries for data analysis and visualization in Python, including Pandas, NumPy, Seaborn, and Matplotlib, with inline plotting enabled.
# Import Libraryimport pandas as pdimport numpy as npimport seaborn as snsimport matplotlib.pyplot as plt%matplotlib inline
Load dataset
We load a CSV file named “churn.csv” located in the specified path churn.csv, which includes RowNumber, CustomerId, Surname, CreditScore, Geography, Gender, Age, Tenure, Balance, NumOfProducts, HasCrCard, IsActiveMember, and Exited attributes and 10000 number of rows. We display the first few rows of the DataFrame using the head() function.
df=pd.read_csv('churn.csv')df.head()
Here’s the output:
Data preprocessing
In this step, we see the statistics of the churn dataset.
df.describe()
Here’s the output:
After that, we count the missing values in the churn dataset. Counting missing values is important because it helps us understand the quality of our data and identify any gaps that might need to be addressed before analysis.
df.isna().sum()
Here’s the output:
Now, we’ll display the column names of the dataset.
df.columns
Here’s the output:
After counting missing values in the churn dataset, we show the box plots of
# Box plots of numerical variables by churn statuscolumn_names=['CreditScore','Age','Tenure','Balance']fig, ax = plt.subplots(2, 2, figsize=(15, 15))# Populate each subplot with a boxplotfor i, subplot in zip(column_names, ax.flatten()):sns.boxplot(x='Exited', y=i, data=df, ax=subplot)# Display the plotplt.show()
The output of the above code represents the distribution of four numerical variables—CreditScore, Age, Tenure, and Balance—segmented by the Exited variable, which likely indicates whether a customer has
CreditScore vs. Exited: The box plot for CreditScore shows that customers who have not exited (
) tend to have higher credit scores (the median is higher), while those who have exited (Exited = 0 They are still active and using the service. ) seem to have lower credit scores. There are some outliers (points that fall outside the whiskers), especially for the exited group, which could indicate customers with very low or high credit scores.Exited = 1 They have stopped using the service or product.
Age vs. Exited: The Age box plot reveals that the median age for customers who did not exit is lower than those who exited. The Exited = 1 group shows a wider range of ages, including some older customers (indicated by higher outliers), suggesting that older customers are more likely to churn.
Tenure vs. Exited: The Tenure box plot indicates that customers who did not exit tend to have a higher average tenure (the median is closer to 6), whereas the Exited group has a slightly lower median tenure. The range for the Exited group is also wider, with a few extreme values.
Balance vs. Exited: For Balance, the customers who did not exit have a higher median balance compared to those who exited, suggesting that higher account balances might be associated with lower churn. The plot shows the presence of outliers in both groups, though more extreme values are found in the Exited group.
Outlier removal
Outliers are data points that significantly differ from the rest and can negatively impact the performance of machine learning models. Removing outliers ensures the dataset is clean and produces accurate results. To detect outliers, we use statistical methods like percentiles, which divide data into 100 equal parts to show its spread. For example, the 25th percentile (lower quartile) marks where 25% of the data lies below, and the 75th percentile (upper quartile) marks where 75% lies below. Using np.percentile, we calculate these values and determine the interquartile range (iqr = quartile75 - quartile25), representing the “normal range” of data.
Outliers are identified as values that fall outside the thresholds:
Minimum = 25th percentile - 1.5 × IQR
Maximum = 75th percentile + 1.5 × IQR
Rows with values below the minimum or above the maximum are filtered out. After removing outliers, box plots of numerical variables by churn status are updated to reflect the cleaned dataset.
# Removing outliersfor col in column_names:quartile75, quartile25 = np.percentile(df[col], [75 ,25])iqr = quartile75 - quartile25min = quartile25 - (iqr*1.5)max = quartile75 + (iqr*1.5)df=df[(df[col]<max)]df=df[(df[col]>min)]# Box plotsfig, ax = plt.subplots(2,2, figsize = (15,15))for col, subplot in zip(column_names, ax.flatten()):sns.boxplot(x = 'Exited', y = col , data = df, ax = subplot)plt.show()
CreditScore: The spread of credit scores is similar for both customers who exited (
Exited = 1) and those who stayed (Exited = 0). No noticeable outliers are present after the removal step.
Age: The age distribution shows that customers who exited tend to have a higher median age compared to those who stayed. The outliers seen previously in the
Exited = 0group are now removed, resulting in a cleaner dataset.
Tenure: The tenure variable displays a similar range for both groups, with a relatively uniform distribution. Outliers have been removed, ensuring a more compact representation.
Balance: Both groups show a similar distribution of balances, with the majority of values concentrated below a specific range. No extreme values remain after outlier removal.
Encoding categorical variables
We encode categorical variables Geography and Gender into numerical labels using LabelEncoder from scikit-learn, because they are the non-numerical features in the dataset, so that machine learning models can process and analyze them effectively. Then we display the first few rows of the DataFrame with the transformed columns. Finally, it prints the classes that were encoded for Geography and Gender.
from sklearn.preprocessing import LabelEncoderle=LabelEncoder()df['Geography']=le.fit_transform(df['Geography'])df['Gender']=le.fit_transform(df['Gender'])le.classes_df.head()
Here, Gender 0 represents Female and 1 represents Male, while Geography is numerically encoded (e.g., 0 for France, 1 for Germany, 2 for Spain). This transformation makes categorical data usable for machine learning models.
Heatmap of the dataset
We create a figure with a heatmap visualization of the correlation matrix of the DataFrame df. The heatmap displays the correlation matrix, which helps identify which features are strongly correlated with each other or with the target variable. This insight is crucial for feature selection and understanding the data’s structure, guiding the model-building process.
plt.figure(figsize=(8,8))sns.heatmap(df.corr(), cmap='Blues', annot=True)plt.show()
The heatmap of the churn dataset is shown below:
The correlation heatmap reveals key insights into the relationships between features and the target variable, Exited. Age shows a moderate positive correlation with Exited, suggesting older customers are more likely to leave, while Gender and Geography have weak correlations with the target. The balance and number of products are moderately negatively correlated, indicating customers with fewer products tend to have higher balances. Additionally, HasCrCard has a slight negative correlation with Exited, implying credit card holders may be less likely to exit. These insights are valuable for feature selection and model building, highlighting which variables to prioritize based on their strength of correlation with the target.
Countplot of categorical features
We create subplots to visualize the count of categorical features (Gender, HasCrCard, IsActiveMember) segmented by the Exited variable. By creating these plots, we can observe patterns or trends in the categorical features, such as how many male vs. female customers exited or how the status of having a credit card affects customer churn. This step helps identify any significant relationships between categorical features and the target variable, aiding in feature selection and understanding the factors influencing customer behavior.
fig, ax = plt.subplots(1,3, figsize = (10,6))categorical_features=['Gender','HasCrCard','IsActiveMember']for col, subplot in zip(categorical_features, ax.flatten()):sns.countplot(x = col, hue="Exited", data = df, ax = subplot)plt.show()
In the output bar plots:
The blue bars represent customers who did not exit (Exited = 0), and the orange bars represent customers who exited (Exited = 1).
Gender: In the first plot, we can observe that most of the customers are male (represented by0in the “Gender” column), and relatively fewer women (represented by1) left the service. The proportion of women who left is much smaller compared to men.HasCrCard: In the second plot, most customers who have a credit card (1) are still with the service, but a significant number of customers without a credit card (0) left.IsActiveMember: The third plot shows that active members (1) are less likely to leave, as seen from the higher blue bars forIsActiveMember = 1.
Try it yourself
Click the “Run” button and then click the link provided under the “Run” button to open the Jupyter Notebook.
Please note that the notebook cells have been pre-configured to display the outputs for your convenience and to facilitate an understanding of the concepts covered. You are encouraged to actively engage with the material by changing the variable values.
Frequently asked questions
Haven’t found what you were looking for? Contact Us
How do you predict customer churn rate?
What is the best model for customer churn prediction?
What algorithm is used in customer churn prediction?
Free Resources