Decision Trees
Explore how decision trees function as intuitive machine learning models for classification and regression. Understand impurity measures like Gini impurity, learn the tree-building process through recursive splitting, and implement decision trees from scratch using numpy and sklearn. Gain practical insight into their role in ensemble learning and how they optimize predictive accuracy.
Decision trees are powerful and widely used machine learning algorithms that are popular for both classification and regression tasks. They’re simple to understand and interpret, making them valuable tools for data analysis and decision-making. Apart from their standalone usage, they often serve as a default choice for a weak learner in ensemble learning. In this lesson, we’ll explore the concept of decision trees, how they work, and implement a basic decision tree from scratch using numpy in Python. We’ll also present its implementation using sklearn.
How do decision trees work?
Decision trees are tree-like structures where each internal node represents a decision based on a specific feature, and each leaf node represents the outcome (class label for classification or predicted value for regression). They work by recursively splitting the dataset into subsets based on the most significant attribute (feature) at each step, with the goal of creating leaf nodes that represent the final predicted class or value.
For example, a decision tree makes predictions by answering a sequence of simple questions like:
-
“Is Age ≤ 30?”
-
“Is Occupation = Engineer?”
At each step, the data is split based on a feature that best separates the classes. This process continues until the subsets become pure, meaning all the samples in a node belong to the same class (for classification) or have very similar values (for regression). A pure subset tells the tree that no further splitting is needed because the data is already perfectly separated. When this happens, the node becomes a leaf node, which gives the final prediction.
Gini impurity
One of the key concepts in decision trees is the calculation of impurity to determine how heterogeneous (mixed) a dataset is. One common impurity measure is the Gini impurity, which measures the probability of incorrectly classifying a randomly chosen element from the dataset. Lower Gini impurity values indicate purer subsets. The formula for Gini impurity is given below:
Here, is the Gini impurity, is the number of classes in the dataset, and is the probability of an element in the given dataset belonging to class .
For binary classes its value would range from 0 (perfectly pure) to 0.5 (maximum impurity). For example, in a “pure” scenario where a fruit basket contains only apples [Apple, Apple, Apple, Apple], the probability of selecting an apple () is , while the probability of selecting an orange () is . Plugging these values into the Gini formula, we calculate , resulting in a Gini Impurity of 0. On the other hand, in a “mixed” scenario where the basket contains two apples and two oranges [Apple, Apple, Orange, Orange], the probability of selecting an apple () is and the probability of selecting an orange () is . By applying the Gini formula, , we arrive at a Gini Impurity of 0.5 which is the maximum possible impurity for a two-class system, indicating that the node is a complete “toss-up” with no ...