Supervised Learning with Sklearn

Explore the supervised learning workflow using scikit-learn by learning data handling, feature cleaning, transformation, and pipeline construction. Understand how to optimize models with hyperparameter tuning methods like GridSearchCV and RandomizedSearchCV to improve regression and classification outcomes.

We'll cover the following...

Data handling in Scikit-learn
- Large datasets
- Synthetic datasets
Feature cleaning
- Handling missing data with imputation
Feature transformation
- Standardization
Polynomial features
Pipelines
Hyperparameter tuning with cross-validation
- The GridSearchCV function
- The RandomizedSearchCV function
Classification example
Conclusion

Scikit-learn (often stylized as sklearn) is the essential Python library for machine learning.

While we've seen it in previous lessons, in this lesson, we’ll take a hands-on journey through the supervised learning workflow with scikit-learn. We’ll start by exploring data handling and feature engineering, ensuring our datasets are ready for modeling.

Then, we’ll see how sklearn’s pipeline utility can simplify and streamline the entire process, letting us combine preprocessing and modeling in one seamless step.

Finally, we’ll delve into advanced model tuning techniques, utilizing GridSearchCV and RandomizedSearchCV to optimize models for both regression and classification tasks, ensuring our models perform at their best without requiring manual trial and error.

Data handling in Scikit-learn

Scikit-learn provides convenient utilities for accessing datasets. The general workflow involves loading a dataset into two main components:

$x$ : The feature matrix (or input data).
$y$ : The target vector (or labels/output).

from sklearn import datasets
X, y = datasets.load_name(return_X_y=True)

Here, name in the datasets.load_name() call is the name of the dataset. For example, there’s a regression dataset for analysis on diabetes named diabetes and it can be loaded as:

X, y = datasets.load_diabetes(return_X_y=True)

The parameter return_X_y=True is a useful convenience feature. It instructs the loading function (e.g., load_diabetes, load_iris) to return the data directly as the tuple (X, y), where $X$ is the feature data and $y$ is the target data, instead of returning a bunch object (a dictionary-like container) that would require accessing the data via attributes like dataset.data and dataset.target.

Let’s look at the structure of the data using the diabetes dataset as an example:

Large datasets

There are several large datasets available from external sources. For example, a popular dataset often used for face recognition is lfw_people and can be downloaded using the following code:

X, y = datasets.fetch_lfw_people(return_X_y=True, min_faces_per_person=70)

Different people have different numbers of face images in the dataset, and the parameter min_faces_per_person can be used to know the number. There are other parameters as well.

Note: This dataset has to be downloaded and, therefore, might take a while.

Python 3.10.4

import numpy as np
from sklearn import datasets
n_samples=1000 
n_features=10
n_informative = n_features//2
X, y = datasets.make_regression(n_samples=n_samples, n_features=n_features,
                                n_informative = n_informative)
print(f'Number of samples = {X.shape[0]}, Number of features = {X.shape[1]}')
n_classes = 3
n_clusters_per_class = 1
X, y = datasets.make_classification(n_samples=n_samples, n_features=n_features, 
                                    n_classes=n_classes, n_clusters_per_class=n_clusters_per_class,
                                    n_informative = n_informative)
print(f'Number of samples = {X.shape[0]}, Number of features = {X.shape[1]}, Number of classes = {len(np.unique(y))}')

Here is the explanation for the code above:

Line 4: A dataset with 1000 rows.
Line 5: Each sample will have 10 input features.
Line 6: Half of the features (i.e., 5) will actually influence the target. The remaining features are noise or redundant.
Lines 9–10: make_regression creates a synthetic regression dataset.
- X: a matrix of shape (1000, 10) containing feature values.
- y: a vector of target values computed from the 5 informative features plus noise.
Line 11: Prints the shape of the regression dataset: number of rows (samples) and number of columns (features).
Line 13: The classification problem will have three classes.
Line 14: Each class will be formed from one cluster in feature space.
Lines 15–17: Generate a multiclass classification dataset. This dataset still contains 10 total features, with 5 of them being informative, and it forms 3 distinct classes, each represented by a single cluster.

Note: For more complex scenarios, sklearn offers functions for datasets where each sample has multiple target values. For multitarget classification (e.g., assigning multiple labels to a single image), we can use make_multilabel_classification. For multi-target regression (e.g., predicting multiple house prices based on different features), use the n_targets parameter in the make_regression call.

Feature cleaning

Feature cleaning is the essential process of preparing data by handling errors, inconsistencies, and especially missing values. All of these are quite common scenarios. For example, if an IoT device is collecting data, issues like electromagnetic noise, voltage spikes, and sensor calibration errors can corrupt data. Separately, network congestion, connectivity problems, device power failures, and sensor damage may result in data loss. A machine learning model performs poorly or fails outright if fed raw, messy data.

Handling missing data with imputation

Imputation fills in missing values instead of dropping rows or columns. Here, we use sklearn’s SimpleImputer to replace NaN values with the median of each feature.

Line 1: Import SimpleImputer from sklearn. This tool automatically replaces missing values with a chosen statistic (mean, median, most frequent, or constant).
Line 2: Import NumPy for creating and handling numerical arrays.
Line 5: Create a small feature matrix containing NaN values. Each row represents a sample, each column a feature.
Line 8: Create a SimpleImputer object.
- missing_values=np.nan tells it what counts as missing.
- strategy='median' means each NaN will be replaced with the median of its column.
Line 11: fit_transform first learns the median from the non-missing values (fit), then applies the replacement ...

Name	Type
boston	Regression
iris	Classification
diabetes	Regression
digits	Classification
linnerud	Regression (multitarget)
wine	Classification
breast_cancer	Classification

1.Course Overview

2.Supervised Learning

Project

3.Clustering

Mini Project

4.Generalized Linear Regression

Mini Project

5.Support Vector Machine

6.Logistic Regression

7.Ensemble Learning

Mini Project

8.Decoding Dimensions: PCA and Autoencoders

Mini Project

Mini Project

Mini Project

9.Appendix

10.Wrapping Up

Project

Supervised Learning with Sklearn

Data handling in Scikit-learn

Large datasets

Synthetic datasets

Feature cleaning

Handling missing data with imputation