Forms of Supervised Learning
Explore the two core forms of supervised learning: classification and regression. Understand how categorical and numeric labels influence model creation and follow the typical workflow from data preparation to model training, using practical examples from datasets. Gain foundational knowledge applicable to various machine learning algorithms in data science.
The two forms of supervised learning
Conceptually, supervised learning is divided into two forms based on the nature of the label used to produce the machine learning model: classification and regression.
Classification
When the label data is categorical, the supervised learning process produces a classification machine learning model.
Classification is arguably the most common scenario where data scientists apply machine learning. Classification problems exist across all types of organizations. Here are some examples:
Fraud detection
Churn prevention
Predicting patient admittance at a hospital
Nonprofit donor conversion
Given the importance of classification in data science, this course focuses on classification as the teaching vehicle for machine learning fundamentals via the Adult Census Income and Titanic datasets.
Regression
When the label data is numeric, the supervised learning process produces a regression machine learning model.
Regression problems have a long history across all types of organizations. Here are some examples:
Sales forecasting
Length of patient hospital stays
Customer lifetime value
Marketing mix
Later, the course covers constructing regression machine learning models. All machine learning fundamentals learned in the context of classification 100 percent apply to regression.
The terminology used in this course
Data science practitioners commonly use the terms “classification” and “regression.” It should be noted that the aforementioned is a conceptual framework and is not technically correct.
For example, a statistician would correctly point out that regression is a family of predictive analytics techniques that can be used with categorical (e.g., logistic regression) and numeric (e.g., linear regression) labels.
Despite the lack of rigor of the above terminology, this course uses the language common in the data science community.
Supervised learning workflow
Supervised learning follows the same high-level workflow, whether the problem is classification or regression. The following image visually depicts the supervised learning workflow:
This workflow uses a sample from the Adult Census Income dataset. The following expands upon each step of the workflow.
Selecting and preparing data
Data is the raw material for supervised learning. The highest quality data is critical for producing the most valuable machine learning models. Data scientists spend most of their time acquiring, understanding, cleaning, and transforming data.
The nature of the label in the training data impacts the output of the supervised learning workflow. The workflow produces a classification machine learning model when the training data label is categorical and it produces a regression machine learning model when the training label is numeric.
The set of data that is used at the start of the workflow is the training data. This is the data used to construct the machine learning model. Splitting data into training and test sets is covered later in the course.
Algorithm
Data scientists select which machine learning algorithm (e.g., random forest) will be applied to the training data to produce the model.
Machine
The machine is typically a laptop or a workstation. However, it is increasingly common for data scientists to leverage cloud-based resources like virtual machines (VMs) and machine learning platform as a service (PaaS) offerings from vendors like Microsoft and Amazon.
Training
The combination of data, algorithm, and machine are the inputs to the training of the machine learning model. Think of training as the steps and configuration the data scientist specifies for constructing the model.
The training regimen specified for the model tremendously impacts the resulting machine learning model.
While model training is a general concept applicable to all algorithms, the training details (e.g., configuration) are algorithm specific. Later, the course covers training decision trees, random forests, and XGBoost.
Model
The final result of the workflow is the machine learning model. A machine learning model is a programming object in R, like a data frame or a vector.
Machine learning model objects can be used with functions (e.g., to make predictions), and can be saved to disk (e.g., as .RData files) after the model has been trained.