Decision Boundaries
Discover how decision boundaries are critical for engineering the best features for decision trees.
We'll cover the following...
Partitioning training data
The concept of decision boundaries is vital to engineering the best features for the CART algorithm. A decision boundary is a geometry of how a machine learning model partitions the training data to produce a prediction. The best features allow the machine learning algorithms to partition the training data into the most effective decision boundaries.
Understanding decision boundaries is most easily accomplished via examples. The following code trains a decision tree to predict Survived
from the combination of Sex
, Pclass
, and Embarked
features. Run the code and examine the tree visualization.
Press + to interact
#================================================================================================# Load libraries - suppress messages#suppressMessages(library(tidyverse))suppressMessages(library(tidymodels))suppressMessages(library(rattle))#================================================================================================# Load the Titanic training data and transform Embarked to a factor#titanic_train <- read_csv("titanic_train.csv", show_col_types = FALSE) %>%mutate(Sex = factor(Sex),Embarked = factor(case_when(Embarked == "C" ~ "Cherbourg",Embarked == "Q" ~ "Queenstown",Embarked == "S" ~ "Southampton",is.na(Embarked) ~ "missing")))#================================================================================================# Craft the recipe - recipes package#titanic_recipe <- recipe(Survived ~ Sex + Pclass + Embarked, data = titanic_train) %>%step_num2factor(Survived,transform = function(x) x + 1,levels = c("perished", "survived")) %>%step_num2factor(Pclass,levels = c("first", "second", "third"))#================================================================================================# Specify the algorithm - parsnip package## Specify a single CART decision tree with no pre-pruning and a value of 14 for the min_n hyperparametertitanic_model <- decision_tree() %>%set_engine("rpart") %>%set_mode("classification")#================================================================================================# Set up the workflow#titanic_workflow <- workflow() %>%add_recipe(titanic_recipe) %>%add_model(titanic_model)#================================================================================================# Fit the model to all the Titanic training data#titanic_fit <- titanic_workflow %>%fit(titanic_train)#================================================================================================# Visualize the tree by extracting the trained model#titanic_tree <- extract_fit_parsnip(titanic_fit)# Write the visualization to a file on diskpng(filename = "output/tree.png", height = 750, width = 750)fancyRpartPlot(titanic_tree$fit, sub = NULL)# Close the device opened by the png() functiondev.off()
...