How to build a decision tree with the IRIS dataset in Python
A decision tree is a machine learning algorithm that uses a tree-like model of decisions and their subsequent consequences to arrive at a particular decision. It is a Supervised Machine Learning model, where the data is continuously split according to a certain parameter, and finally, a decision is made.
Usually, a decision tree is drawn upside down, with the root node at the top and the leaf nodes at the bottom. A decision tree usually contains 3 types of nodes.
- Root node: The very top node that represents the entire population or sample.
- Decision nodes: Sub-nodes that split from the root node.
- Leaf nodes: Nodes with no children, also known as terminal nodes.
Description
In Machine Learning, we have two types of models:
- Regression
- Classification
You can use decision trees in Regression and Classification problems.
-
Regression tree: These are used to predict continuous variables. For example, predicting rainfall in a region or predicting the revenue that a company might generate in the future.
-
Classification tree: These are used to classify discrete variables. For example, classifying if the temperature of a day will be high or low, or predicting if a team will win the match or not.
How decision trees work
Decision trees work in a step-wise manner, meaning that they perform a step-by-step process instead of following a continuous process. Decision trees follow a tree-like structure, where the nodes of a tree are split using the features based on defined criteria. The main criteria based on which decision trees split are:
-
Gini impurity: Measures the impurity in a node.
-
Entropy: Measures the randomness of the system.
-
Variance: This is normally used in the Regression model, which is a measure of the variation of each data point from the mean.
Practical implementation
Let’s use a real-world dataset to apply decision tree algorithms in Python. You can follow the steps below to create a feasible and useful decision tree:
Import the libraries
We import the required libraries for the model. load_iris from sklearn.datasets and accuracy_score from metrics.
import pandas as pdimport numpy as npfrom sklearn.datasets import load_irisfrom sklearn.metrics import accuracy_score
Gather the data
We will be using the IRIS dataset to build a decision tree classifier. The dataset contains information for three classes of the IRIS plant, namely IRIS Setosa, IRIS Versicolour, and IRIS Virginica, with the following attributes: sepal length, sepal width, petal length, and petal width.
data = load_iris()# Extracting Attributes / FeaturesX = data.data# Extracting Target / Class Labelsy = data.target
Import the required Python library and build a data frame
Import the train_test_split and convert the data set into training and testing data.
# Import Library for splitting datafrom sklearn.model_selection import train_test_split# Creating Train and Test datasetsX_train, X_test, y_train, y_test = train_test_split(X,y, random_state = 50, test_size = 0.25)
Create the model in Python
Import DecisionTreeClassifier to perform the classification.
from sklearn.tree import DecisionTreeClassifierclf = DecisionTreeClassifier()
Use the test dataset to make a prediction
Our aim is to predict the class of the IRIS plant based on the given attributes.
Complete code
Let’s take a look at the code.
import pandas as pdimport numpy as npfrom sklearn.datasets import load_irisfrom sklearn.metrics import accuracy_score# Reading the Iris.csv filedata = load_iris()# Extracting Attributes / FeaturesX = data.data# Extracting Target / Class Labelsy = data.target# Import Library for splitting datafrom sklearn.model_selection import train_test_split# Creating Train and Test datasetsX_train, X_test, y_train, y_test = train_test_split(X,y, random_state = 50, test_size = 0.25)# Creating Decision Tree Classifierfrom sklearn.tree import DecisionTreeClassifierclf = DecisionTreeClassifier()clf.fit(X_train,y_train)# Predict Accuracy Scorey_pred = clf.predict(X_test)print("Train data accuracy:",accuracy_score(y_true = y_train, y_pred=clf.predict(X_train)))print("Test data accuracy:",accuracy_score(y_true = y_test, y_pred=y_pred))
Explanation
-
Line 1-4: We import the necessary libraries to read and analyze the dataset.
-
Line 7: We store the IRIS dataset in the variable
data. Since thesklearnlibrary contains the IRIS dataset by default, you do not need to upload it again. -
In Line 10:, we extract all of the attributes in variable
X. -
In Line 13: we extract the target, i.e., the labels in variable
y. -
Line 16: we import the
train_test_splitfunction. -
Line 19 we implement the
train_test_split()function. The parameterrandom_statecan be randomly set to any value, but the same needs to be maintained in order to produce reproducible splits. The parametertest_sizecan also be manipulated based on need. Here, we use atest_sizeof 0.25, which indicates that we want to split the test data as 25% of the total dataset, and the remaining 75% will be assigned as training data. -
Lines 22-24: we create a decision tree classifier and fit it against the training dataset. By default, the criterion parameter is set to
gini.
- Line 27-30: we import the “accuracy_score” module and implement the same to find the accuracy of both the training and test data.
- Line 28-29, we get the output as 1, i.e., 100% for training data and 0.947, which is approximately 95%, for the test dataset.