Decision Trees

This lesson will focus on training decision tree models in Python.

Decision trees

In the first lesson of this chapter, we talked about how linear regression models focus only on linear relationships between the dependent and independent variables; they fail to capture nonlinear relationships. Decision trees are made to capture nonlinear relationships.

Decision trees model data as a tree of hierarchical branches. It is a flowchart-like structure in which each internal node represents a test on an attribute (e.g. whether a coin flip comes up heads or tails), each branch represents the outcome of the test, and each leaf node represents a class label (decision taken after computing all attributes). The paths from the root to the leaf represent classification rules. Decision Trees can adapt to both regression and classification tasks.

Common terms used with Decision trees:

  • Root node: It represents the entire population or sample, and this further gets divided into two or more homogeneous sets.

  • Splitting: It is a process of dividing a node into two or more sub-nodes.

  • Decision node: When a sub-node splits into further sub-nodes, then it is called a decision node.

  • Leaf/Terminal node: Nodes that do not split are called Leaf or Terminal node.

  • Pruning: When we remove sub-nodes of a decision node, this process is called pruning. It is the opposite process of splitting.

  • Branch/Sub-tree: A subsection of the entire tree is called branch or sub-tree.

  • Parent and Child node: A node, which is divided into sub-nodes is called a parent node of sub-nodes, whereas sub-nodes are the children of the parent node.

A very common example that is given in the context of decision trees is that we want to classify a person as unfit or fit based on the person’s age, whether he/she eats pizza, and whether he/she exercises in the morning. A decision tree of this could be:

Get hands-on with 1200+ tech skills courses.