Search⌘ K
AI Features

Forms of Supervised Learning

Explore the two core forms of supervised learning: classification and regression. Understand how categorical and numeric labels influence model creation and follow the typical workflow from data preparation to model training, using practical examples from datasets. Gain foundational knowledge applicable to various machine learning algorithms in data science.

The two forms of supervised learning

Conceptually, supervised learning is divided into two forms based on the nature of the label used to produce the machine learning model: classification and regression.

Classification

When the label data is categorical, the supervised learning process produces a classification machine learning model.

Classification is arguably the most common scenario where data scientists apply machine learning. Classification problems exist across all types of organizations. Here are some examples:

  • Fraud detection

  • Churn prevention

  • Predicting patient admittance at a hospital

  • Nonprofit donor conversion

Given the importance of classification in data science, this course focuses on classification as the teaching vehicle for machine learning fundamentals via the Adult Census Income and Titanic datasets.

Regression

When the label data is numeric, the supervised learning process produces a regression machine learning model.

Regression problems have a long history across all types of organizations. Here are some examples:

  • Sales forecasting

  • Length of patient hospital stays

  • Customer lifetime value

  • Marketing mix

Later, the course covers constructing regression machine learning models. All machine learning fundamentals learned in the context of classification 100 percent apply to regression.

The terminology used in this course

Data science practitioners commonly use the terms “classification” and “regression.” It should be noted that the aforementioned is a conceptual framework and is not technically correct.

For example, a statistician would correctly point out that regression is a family of predictive analytics techniques that can be used with categorical (e.g., logistic regression) and numeric (e.g., linear regression) labels.

Despite the lack of rigor of the above terminology, this course uses the language common in the data science community.

Supervised learning workflow

Supervised learning follows the same high-level workflow, whether the problem is classification or regression. The following image visually depicts the supervised learning workflow:

The supervised learning workflow
The supervised learning workflow

This workflow uses a sample from the Adult Census Income dataset. The following expands upon each step of the workflow.

Selecting and preparing data

Data is the raw material for supervised learning. The highest quality data is critical for producing the most valuable machine learning models. Data scientists spend most of their time acquiring, understanding, cleaning, and transforming data.

The nature of the label in the training data impacts the output of the supervised learning workflow. The workflow produces a classification machine learning model when the training data label is categorical and it produces a regression machine learning model when the training label is numeric.

The set of data that is used at the start of the workflow is the training data. This is the data used to construct the machine learning model. Splitting data into training and test sets is covered later in the course.

Algorithm

Data scientists select which machine learning algorithm (e.g., random forest) will be applied to the training data to produce the model.

Machine

The machine is typically a laptop or a workstation. However, it is increasingly common for data scientists to leverage cloud-based resources like virtual machines (VMs) and machine learning platform as a service (PaaS) offerings from vendors like Microsoft and Amazon.

Training

The combination of data, algorithm, and machine are the inputs to the training of the machine learning model. Think of training as the steps and configuration the data scientist specifies for constructing the model.

The training regimen specified for the model tremendously impacts the resulting machine learning model.

While model training is a general concept applicable to all algorithms, the training details (e.g., configuration) are algorithm specific. Later, the course covers training decision trees, random forests, and XGBoost.

Model

The final result of the workflow is the machine learning model. A machine learning model is a programming object in R, like a data frame or a vector.

Machine learning model objects can be used with functions (e.g., to make predictions), and can be saved to disk (e.g., as .RData files) after the model has been trained.