Why do we need transformations?
This lesson will cover the meaning and the purpose of transforming variables in our datasets, discovering the distribution of our features, and the relation between the variable’s distribution and the mathematical transformation. We will talk about some of the most frequently used methods to transform variables during the feature engineering process during this lesson.
Usually, dataset features tend to follow a skewed distribution. Moreover, since normality is an essential assumption for many machine learning models, it is vital to find ways to fulfill it. This is why, in this section, we will discuss the different mathematical transformations we can use on our variables to achieve that.
Why these transformations?
When it comes to machine learning models, a few like linear and logistic regression assume that the variables follow a normal distribution, which means they give a better outcome when handling normally distributed variables. However, in real datasets, variables most likely follow a skewed distribution.
In order to improve the performance of our models, we apply some transformations to our variables to map their skewed distribution to a normal one.
First, we have to determine the distribution of our variables, if it is normal or skewed. We can do that using histograms and Q-Q plots. Here is an example of a Q-Q plot:
When using the Q-Q plots, if the values of a variable fall within a 45-degree line when plotted against the theoretical quantiles, then the variable follows a normal distribution.
Run this code to build the previous plot:
import matplotlib.pyplot as pltimport scipy.stats as statsimport pandas as pdimport seaborn as sns# loading the titanic dataset.data = sns.load_dataset("titanic")# display some portion of the dataprint(data.head())# show the QQ plot.stats.probplot(data["fare"], dist="norm", plot=plt)plt.show()plt.savefig('output/box.png')
How can we transform variables?
Among the transformation techniques, these are the most widely-used ones:
- Logarithmic transformation
- Square root transformation
- Reciprocal transformation
- Exponential or power transformation
- Box-Cox transformation
- Yeo-Johnson transformation
Be aware that like any technique, these transformations also have some limitations; for example, the logarithmic transformation is defined only for positive numbers; reciprocal transformation is naturally not defined for zero.