What is a pruning algorithm?

Introduction

In data mining and machine learning, classification reduces the data set into smaller subsets until all we are left with are decision nodes and leaf nodes. Pruning essentially optimizes the classification process by removing unnecessary nodes and making the decision tree smaller.

The most commonly used pruning algorithm on decision trees is the Alpha-Beta pruning. You can have a quick look at it over here.

There are multiple ways to prune your decision tree. Some of which are:

Pruning by information gain
Pruning by classification performance on the validation set

Pruning by information gain makes use of the information initially available when the tree is built from the training data.

Pruning by classification performance on the validation set makes use of the validation dataset and prunes the decision tree according to the best classification on the validation dataset.

Pruning by information gain

The algorithm is as follows:

Catalog all twigsnodes whose children are all leaves.
Keep a total count of all the leaves in the tree.
Keep a threshold of the number of leaves in the tree needed.
Loop until the number of leaves in the tree exceeds the set threshold.
Find the twig which gives the least information gain.
Take the twig and remove its children.
We remove the children because we aren’t gaining enough information from the node, and hence the node can be declared irrelevant.
Now relabel the twig to be a leaf.
Change the leaf count.

What is a pruning algorithm?

Introduction

Methods of pruning

Pruning algorithm

Pruning by information gain

Pruning by classification performance on the validation set