2- Random Forests

This lesson will provide an overview of random forests and go over the steps involved in their implementation.

Introduction to random forests

While decision trees are useful for explaining a model’s decision structure, they are also prone to overfitting.

In general, decision trees are accurate at decoding patterns using the training data, but since there is a fixed sequence of decision paths, any variance in the test data or any new data can result in poor predictions. The fact that there is only one tree design also limits the flexibility of this method to manage variance and future outliers.

A solution to mitigate overfitting is to grow multiple trees using a different technique called RFRandom Forests. This method involves growing multiple decision trees using a randomized selection of input data for each tree and combining the results by averaging the output for regression or class voting for classification.

The variables selected to divide the data are also randomized and capped. If the entire forest inspected a full set of variables, each tree would look similar. The trees would attempt to maximize information gain at the subsequent layer and select the optimal variable at each split.

Unlike a standard decision tree with a full set of variables to draw from, though, the random forests algorithm has an artificially limited set of variables available to build decisions. Due to fewer variables shown and the randomized data provided, random forests are less likely to generate a collection of similar trees. Embracing randomness and volume, random forests are subsequently capable of providing a reliable result with potentially less variance and overfitting than a single decision tree.

The following steps are involved in the implementation of RF:

  • 1 - Import libraries
  • 2 - Import dataset
  • 3 - Convert non-numeric variables
  • 4 - Remove columns
  • 5 - Set X and y variables
  • 6 - Set algorithm
  • 7 - Evaluate

Exercise

This exercise is a repeat attempt of the previous one, using the Advertising dataset and the same dependent and independent variables but built using RandomForestClassifier from Scikit-learn.


Get hands-on with 1200+ tech skills courses.