Since our focus is machine learning, let's split the data and move on to train the model.

Python 3.8

# Separating features and the target in X, y
X = df.drop('target',axis=1) # X are features, need to drop target column
y = df['target'] # y is target!
# train_test_split!
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)

We use a single decision tree, but we see that the model is mislabeling some. We also know that decision trees can be very easy to overfit, limiting generalization and leading to poor performance on unseen data.

Bagged decision trees

We learned about bagging (Bootstrap aggregation) as a general-purpose procedure for reducing the high variance. So, if we opt for bagged decision trees, they are expected to perform better than a single decision tree. However, due to their structural similarities, they are still strongly correlated in their predictions. The random forest method is always preferred and recommended over the single or even the bagged trees method. Let’s try bagged trees and then move on to the random forest for comparisons.

Python 3.8

# import required for bagging
from sklearn.ensemble import BaggingClassifier
#creating instance for bagging and passing dtree classifier along with other parameters
base_estimator = DecisionTreeClassifier(criterion='entropy') # base estimator for BaggingClassifier
bagged_trees = BaggingClassifier(
    base_estimator=base_estimator,
    n_estimators=5,# number of trees we want, try different numbers
    bootstrap=True, # default value
    bootstrap_features=True,# in-case we want to bootstrap features as well
    max_features=8, # how many maximum number of features we want in each bootstrapped sample
    random_state=42) # ensure reproducible results
bagged_trees.fit(X_train, y_train) #fitting/training

1.Course Introduction

2.Linear Regression

3.Regularization

4.Bias-Variance Trade-off

5.Categorical Features

6.Logistic Regression

7.Logistic Regression: Titanic Data

Project

Sentiment Analysis Using Multinomial Logistic Regression

8.Multiclass Classification and Handling Imbalanced Classes

9.Project: Predicting Chronic Kidney Disease

10.K-Nearest Neighbors

11.Implementation of K-Nearest Neighbors

12.Logistic Regression vs. KNN

13.Decision Tree Learning

Project

Implement the Decision Tree Classifier from Scratch

14.Bootstrapping and Confidence Interval

15.Support Vector Machine

16.Practice and Comparisons

17.What's Next?

18.Appendix

Machine Learning

Single decision tree

Prediction and evaluation

Bagged decision trees