Data Science with R: Decision Trees and Random Forests/

...

Preparing the Test Dataset

Learn how to split data into training and test datasets, and then prepare the test dataset for predictions.

We'll cover the following...

Splitting the data
Transforming the data

Splitting the data

The first step of any machine learning project is splitting the data into training and test datasets. The training dataset is used throughout crafting machine learning models, including exploratory data analysis (EDA), feature engineering, training, and tuning. The test dataset is used at the end of the project as the final test of a machine learning model’s prediction quality.

The rsample package offers the initial_split(), training(), and testing() functions for splitting data. The following code demonstrates using the Adult Census Income dataset:

Press + to interact

#================================================================================================
# Load libraries - suppress messages
#
suppressMessages(library(tidyverse))
suppressMessages(library(tidymodels))
#================================================================================================
# Load the Adult Census Income dataset
#
adult_census <- read_csv("adult_census.csv", show_col_types = FALSE)
#================================================================================================
# Load the Adult Census Income dataset, create factors, and engineer a new feature
#
# It is best practice to set the seed for split reproducibility
set.seed(498798)
adult_split <- initial_split(adult_census, prop = 0.8, strata = "income")
# Create the training and test data frames
adult_train <- training(adult_split)
adult_test <- testing(adult_split)
str(adult_train)
str(adult_test)

Welcome to the Course

Supervised Learning

Classification Tree Math

Using Classification Trees in R

Introducing the Bias-Variance Tradeoff

Model Tuning

Model Tuning with tidymodels

Feature Engineering

Regression Trees

The Random Forest Algorithm

Using Random Forests

Gradient Boosting Trees

Continuing Your Journey

Credit Card Fraud Detection using the R Language

Preparing the Test Dataset

Splitting the data