Search⌘ K

Supervised Learning with Sklearn

Get hands on experience of data science basics using sklearn.

Scikit-learn (often stylized as sklearn) is the essential Python library for machine learning. While we've seen it in previous lessons, in this lesson, we’ll take a hands-on journey through the supervised learning workflow with scikit-learn. We’ll start by exploring data handling and feature engineering, ensuring our datasets are ready for modeling. Then, we’ll see how sklearn’s pipeline utility can simplify and streamline the entire process, letting us combine preprocessing and modeling in one seamless step. Finally, we’ll delve into advanced model tuning techniques, utilizing GridSearchCV and RandomizedSearchCV to optimize models for both regression and classification tasks, ensuring our models perform at their best without requiring manual trial and error.

Data handling in Scikit-learn

Scikit-learn provides convenient utilities for accessing datasets. The general workflow involves loading a dataset into two main components:

  • XX: The feature matrix (or input data).
  • yy: The target vector (or labels/output).
from sklearn import datasets
X, y = datasets.load_name(return_X_y=True)

Here, name in the datasets.load_name() call is the name of the dataset. For example, there’s a regression dataset for analysis on diabetes named diabetes and it can be loaded as:

X, y = datasets.load_diabetes(return_X_y=True)

The parameter return_X_y=True is a useful convenience feature. It instructs the loading function (e.g., load_diabetes, load_iris) to return the data directly as the tuple (X, y), where XX is the feature data and yy is the target data, instead of returning a bunch object (a dictionary-like container) that would require accessing the data via attributes like dataset.data and dataset.target.

Let’s look at the structure of the data using the diabetes dataset as an example:

Python 3.10.4
import numpy as np
from sklearn import datasets
X, y = datasets.load_diabetes(return_X_y=True)
print(f'Number of samples = {X.shape[0]}, Number of features = {X.shape[1]}')
  • Line 5: The expression X.shape returns a tuple representing the dimensions of the NumPy array xx.

    • X.shape[0] gives the number of samples (or observations) in the dataset. This represents the number of rows.

    • X.shape[1] gives the number of features (or characteristics) for each sample. This represents the number of columns.

In the output, we see the total number of data points available and the number of input variables used to describe each data point.

The list of available toy datasets is as follows:

Name

Type

boston

Regression

iris

Classification

diabetes

Regression

digits

Classification

linnerud

Regression (multitarget)

wine

Classification

breast_cancer

Classification

Large datasets

There are several large datasets available from external sources. For example, a popular dataset often used for face recognition is lfw_people and can be downloaded using the following code:

X, y = datasets.fetch_lfw_people(return_X_y=True, min_faces_per_person=70)

Different people have different numbers of face images in the dataset, and the parameter min_faces_per_person can be used to know the number. There are other parameters as well.

Note: This dataset has to be downloaded and, therefore, might take a while.

Python 3.10.4
import numpy as np
from sklearn import datasets
X, y = datasets.fetch_lfw_people(return_X_y=True, min_faces_per_person=70)
print(f'Number of samples = {X.shape[0]}, Number of features = {X.shape[1]}')

Synthetic datasets

For quick testing of some models, it might be handy to create a synthetic dataset for regression or classification tasks. Regression and classification datasets can be created using calls to make_regression and make_classification, respectively.

Python 3.10.4
import numpy as np
from sklearn import datasets
n_samples=1000
n_features=10
n_informative = n_features//2
X, y = datasets.make_regression(n_samples=n_samples, n_features=n_features,
n_informative = n_informative)
print(f'Number of samples = {X.shape[0]}, Number of features = {X.shape[1]}')
n_classes = 3
n_clusters_per_class = 1
X, y = datasets.make_classification(n_samples=n_samples, n_features=n_features,
n_classes=n_classes, n_clusters_per_class=n_clusters_per_class,
n_informative = n_informative)
print(f'Number of samples = {X.shape[0]}, Number of features = {X.shape[1]}, Number of classes = {len(np.unique(y))}')

Here is the explanation for the code above:

  • Line 4: A dataset with 1000 rows.
  • Line 5: Each sample will have 10 input features.
  • Line 6: Half of the features (i.e., 5) will actually influence the target. The remaining features are noise or redundant.
  • Lines 9–10: make_regression creates a synthetic regression dataset.
    • X: a matrix of shape (1000, 10) containing feature values.
    • y: a vector of target values computed from the 5 informative features plus noise.
  • Line 11: Prints the shape of the regression dataset: number of rows (samples) and number of columns (features).
  • Line 13: The classification problem will have three classes.
  • Line 14: Each class will be formed from one cluster in feature space.
  • Lines 15–17: Generate a multiclass classification dataset. This dataset still contains 10 total features, with 5 of them being informative, and it forms 3 distinct classes, each represented by a single cluster.

Note: For more complex scenarios, sklearn offers functions for datasets where each sample has multiple target values. For multitarget classification (e.g., assigning multiple labels to a single image), we can use make_multilabel_classification. For multi-target regression (e.g., predicting multiple house prices based on different features), use the n_targets parameter in the make_regression call.

Feature cleaning

Feature cleaning is the essential process of preparing data by handling errors, inconsistencies, and especially missing values. A machine learning model performs poorly or fails outright if fed raw, messy data.

Handling missing data with imputation

Imputation fills in missing values instead of dropping rows or columns. Here, we use sklearn’s SimpleImputer to replace NaN values with the median of each feature.

Python 3.10.4
from sklearn.impute import SimpleImputer
import numpy as np
# Sample data with missing values (NaN)
X_with_nans = np.array([[1, 2], [np.nan, 3], [7, 6], [8, np.nan]])
# Use the median strategy for imputation
imputer = SimpleImputer(missing_values=np.nan, strategy='median')
# Fit and transform the data
X_imputed = imputer.fit_transform(X_with_nans)
print("Imputed Data:\n", X_imputed) # Output: [[1. 2.] [7. 3.] [7. 6.] [8. 3.]]
  • Line 1: Import SimpleImputer from sklearn. This tool automatically replaces missing values with a chosen statistic (mean, median, most frequent, or constant).

  • Line 2: Import NumPy for creating and handling numerical arrays.

  • Line 5: Create a small feature matrix containing NaN values. Each row represents a sample, each column a feature.

  • Line 8: Create a SimpleImputer object.

    • missing_values=np.nan tells it what counts as missing.

    • strategy='median' means each NaN will be replaced with the median of its column.

  • Line 11: fit_transform first learns the median from the non-missing values (fit), then applies the replacement to fill NaNs (transform).

  • Line 12: Print the imputed dataset. All NaN values are replaced with the median of their respective columns.

After cleaning, features often need to be mathematically transformed to improve their structure for the learning algorithm. ...