Machine Learning
Train single and bagged decision trees along with random forest trees and evaluate.
We'll cover the following...
Since our focus is machine learning, let's split the data and move on to train the model.
We'll start with training a single decision tree and then compare the results with a random forest.
Single decision tree
Let's train a single decision tree. The default splitting criterion is Gini. We can set it to entropy here (information gain is based on entropy).
Notice that we’re leaving everything as default, other than the criterion.
Prediction and evaluation
Evaluation is important because then we can see how the model works.
We use a single decision tree, but we see that the model is mislabeling some. We also know that decision trees can be very easy to overfit, limiting generalization and leading to poor performance on unseen data.
Bagged decision trees
We learned about bagging (Bootstrap aggregation) as a general-purpose procedure for reducing the high variance. So, if we opt for bagged decision trees, they are expected to perform better than a single decision tree. However, due to their structural similarities, they are still strongly correlated in their predictions. The random forest method is always preferred and recommended over the single or even the bagged trees method. Let’s try bagged trees and then move on to the random forest for comparisons.
We have trained five bagged trees, and the final prediction for any test data will come from voting these bagged trees (the base estimator). As we have set the module to Bootstrap features (columns), let's see which features are used in the first two bagged trees for training. Please note that changing the random_state ...