Scaling
Learn how to scale data with scikit-learn.
We'll cover the following...
Scaling means transforming numerical variables so that they have a similar StandardScaler
, MinMaxScaler
, and RobustScaler
.
To better illustrate this point, let’s take a look at how some algorithms can be impacted by the scale of the variables (this code is just for illustration purposes; there is no need to memorize it):
from sklearn.datasets import load_breast_cancerfrom sklearn.model_selection import train_test_splitfrom sklearn.svm import SVCfrom sklearn.metrics import accuracy_scorefrom sklearn.preprocessing import StandardScaler# load the breast cancer datasetcancer = load_breast_cancer()# split the data into training and test setsX_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target, random_state=42)# fit an SVM model to the training data and evaluate its performance on the test datasvm = SVC(random_state=42)svm.fit(X_train, y_train)y_pred = svm.predict(X_test)acc = accuracy_score(y_test, y_pred)print(f"Accuracy without scaling: {acc:.2f}") # output: Accuracy without scaling: 0.94# scale the input features using StandardScalerscaler = StandardScaler()X_train_scaled = scaler.fit_transform(X_train)X_test_scaled = scaler.transform(X_test)# fit an SVM model to the scaled training data and evaluate its performance on the scaled test datasvm_scaled = SVC(random_state=42)svm_scaled.fit(X_train_scaled, y_train)y_pred_scaled = svm_scaled.predict(X_test_scaled)acc_scaled = accuracy_score(y_test, y_pred_scaled)print(f"Accuracy with scaling: {acc_scaled:.2f}") # output: Accuracy with scaling: 0.97
Notice how accuracy has drastically increased after scaling the variables. This happens with certain algorithms that are sensitive to the scale of variables. Let’s see some of the methods we can use to address this.
The StandardScaler
method
The StandardScaler
method transforms each numerical variable so that it has a mean of zero and a standard deviation of one. This is achieved by subtracting the mean of each variable from each value and then dividing by the standard deviation:
where
The following code demonstrates how to use the ...