Supervised Learning with Sklearn
Get hands on experience of data science basics using sklearn.
We'll cover the following...
What is sklearn?
The sklearn library, mostly known as scikit-learn, is one of the most useful, efficient, and robust libraries for machine learning in Python. It’s built upon numpy, scipy, and matplotlib. It provides tools for almost all popular algorithms for machine learning. However, in this lesson, we’ll focus on the supervised learning pipeline.
Datasets
There are a few toy datasets available in sklearn that don’t require downloading from any external source. The code for loading different datasets is consistent:
from sklearn import datasets
X, y = datasets.load_name(return_X_y=True)
Here, name in the datasets.load_name() call is the name of the dataset. For example, there’s a regression dataset for analysis on diabetes named diabetes and can be loaded as:
X, y = datasets.load_diabetes(return_X_y=True)
Let’s check the total number of samples in the diabetes dataset along with its feature count:
import numpy as npfrom sklearn import datasetsX, y = datasets.load_diabetes(return_X_y=True)print(f'Number of samples = {X.shape[0]}, Number of features = {X.shape[1]}')
Toy datasets
The list of the available toy datasets is as follows:
Name | Type |
boston | Regression |
iris | Classification |
diabetes | Regression |
digits | Classification |
linnerud | Regression (multitarget) |
wine | Classification |
breast_cancer | Classification |
Large datasets
There are several large datasets available from external sources. For example, a popular dataset often used for face recognition is lfw_people and can be downloaded using the following code:
X, y = datasets.fetch_lfw_people(return_X_y=True, min_faces_per_person=70)
Different people have different numbers of face images in the dataset, and the parameter min_faces_per_person can be used to know the number. There are other parameters as well.
Note: This dataset has to be downloaded and, therefore, might take a while.
import numpy as npfrom sklearn import datasetsX, y = datasets.fetch_lfw_people(return_X_y=True, min_faces_per_person=70)print(f'Number of samples = {X.shape[0]}, Number of features = {X.shape[1]}')
Synthetic datasets
For quick testing of some models, it might be handy to create a synthetic dataset for regression or classification tasks. Regression and classification datasets can be created using calls to make_regression and make_classification, respectively.
import numpy as npfrom sklearn import datasetsn_samples=1000n_features=10n_informative = n_features//2X, y = datasets.make_regression(n_samples=n_samples, n_features=n_features,n_informative = n_informative)print(f'Number of samples = {X.shape[0]}, Number of features = {X.shape[1]}')n_classes = 3n_clusters_per_class = 1X, y = datasets.make_classification(n_samples=n_samples, n_features=n_features,n_classes=n_classes, n_clusters_per_class=n_clusters_per_class,n_informative = n_informative)print(f'Number of samples = {X.shape[0]}, Number of features = {X.shape[1]}, Number of classes = {len(np.unique(y))}')
Here is the explanation for the code above:
-
Lines 10–11: We generate two synthetic datasets using the
make_regressionandmake_classificationfunctions from thesklearnlibrary. Themake_regressionfunction generates a dataset for regression tasks with1000samples,10features, and5informative features (half of the total features). -
Lines 15–18: We use the
make_classificationfunction to generate a dataset for classification tasks with1000samples,10features,3classes,1cluster per class, and5informative features (half of the total features). After generation, we print the dimensions of the generated datasets, including the number of samples, features, and classes (for the classification dataset).
Note: For multitarget classification, see
make_multilabel_classificationand for multi-target regression use then_targetsparameter in the callmake_regression.
Feature transformation
Feature transformation is the process of converting a dataset’s original features or variables into a new set of features using various mathematical functions. The aim of feature transformation is to improve the performance of machine learning algorithms by transforming the data to make it easier for algorithms to learn patterns and make accurate predictions.
Standardization
Data preprocessing often improves performance. Although there are several ways to preprocess the data, we’ll discuss one of the most popular, that is, StandardScaler, which scales each feature (column) so that it has a mean of and a standard deviation of .
The original data plot shows a scatter plot of two variables, “YearsExperience” and “Salary,” which correlate positively. This means that as the years of experience increase, so does the salary. Then, the same scatter plot is shown after applying the StandardScaler to the “YearsExperience” variable. We can see that the shape of the scatter plot remains the same, and the positive correlation between “YearsExperience” and “Salary” is still present. This transformation helps compare the relative magnitude of different features. Importantly, it’s worth noting that the StandardScaler doesn’t change the original data points; it only standardizes the data to have a mean of and a standard deviation of .
import numpy as npfrom sklearn import datasetsfrom sklearn.preprocessing import StandardScalernp.set_printoptions(precision=2, suppress=True)X, y = datasets.make_regression(n_samples = 500, n_features = 2)print('Means : ', np.abs(np.mean(X, axis = 0))) # mean of each feature across samplesprint('Variances: ', np.var(X, axis = 0)) # variance of each feature across samplesscalar = StandardScaler().fit(X)X_scaled = scalar.transform(X)print('Means after standardization: ', np.abs(np.mean(X_scaled, axis = 0))) # mean of each feature across samplesprint('Variances after standardization: ', np.var(X_scaled, axis = 0)) # variance of each feature across samples
The above code shows that StandardScaler() has scaled the mean to 0 and the variance of features to 1.
Polynomial features
The PolynomialFeatures function transforms the features, and linear models can then be used on the transformed features for polynomial fitting. For example, if there are two features, say, ...