Scaling

Learn how to scale data with scikit-learn.

Scaling means transforming numerical variables so that they have a similar scaleThe range or magnitude of values that the variable can take.. It’s an important step in the ML process because some algorithms are sensitive to the scale of the input variables. The scikit-learn library provides several methods for scaling numerical variables, including StandardScaler, MinMaxScaler, and RobustScaler.

To better illustrate this point, let’s take a look at how some algorithms can be impacted by the scale of the variables (this code is just for illustration purposes; there is no need to memorize it):

Press + to interact
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler
# load the breast cancer dataset
cancer = load_breast_cancer()
# split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target, random_state=42)
# fit an SVM model to the training data and evaluate its performance on the test data
svm = SVC(random_state=42)
svm.fit(X_train, y_train)
y_pred = svm.predict(X_test)
acc = accuracy_score(y_test, y_pred)
print(f"Accuracy without scaling: {acc:.2f}") # output: Accuracy without scaling: 0.94
# scale the input features using StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# fit an SVM model to the scaled training data and evaluate its performance on the scaled test data
svm_scaled = SVC(random_state=42)
svm_scaled.fit(X_train_scaled, y_train)
y_pred_scaled = svm_scaled.predict(X_test_scaled)
acc_scaled = accuracy_score(y_test, y_pred_scaled)
print(f"Accuracy with scaling: {acc_scaled:.2f}") # output: Accuracy with scaling: 0.97

Notice how accuracy has drastically increased after scaling the variables. This happens with certain algorithms that are sensitive to the scale of variables. Let’s see some of the methods we can use to address this.

The StandardScaler method

The StandardScaler method transforms each numerical variable so that it has a mean of zero and a standard deviation of one. This is achieved by subtracting the mean of each variable from each value and then dividing by the standard deviation:

where xstdx_{\text{std}} is the standardized value of xx, xx is the original value of the variable, μ\mu is the mean of the variable, and σ\sigma is the standard deviation of the variable.

The following code demonstrates how to use the ...