How to build a decision tree with the IRIS dataset in Python

A decision tree is a machine learning algorithm that uses a tree-like model of decisions and their subsequent consequences to arrive at a particular decision. It is a Supervised Machine Learning model, where the data is continuously split according to a certain parameter, and finally, a decision is made.

Usually, a decision tree is drawn upside down, with the root node at the top and the leaf nodes at the bottom. A decision tree usually contains 3 types of nodes.

Root node: The very top node that represents the entire population or sample.
Decision nodes: Sub-nodes that split from the root node.
Leaf nodes: Nodes with no children, also known as terminal nodes.

Description

In Machine Learning, we have two types of models:

Regression
Classification

You can use decision trees in Regression and Classification problems.

Regression tree: These are used to predict continuous variables. For example, predicting rainfall in a region or predicting the revenue that a company might generate in the future.
Classification tree: These are used to classify discrete variables. For example, classifying if the temperature of a day will be high or low, or predicting if a team will win the match or not.

How decision trees work

Decision trees work in a step-wise manner, meaning that they perform a step-by-step process instead of following a continuous process. Decision trees follow a tree-like structure, where the nodes of a tree are split using the features based on defined criteria. The main criteria based on which decision trees split are:

Gini impurity: Measures the impurity in a node.
Entropy: Measures the randomness of the system.
Variance: This is normally used in the Regression model, which is a measure of the variation of each data point from the mean.

import pandas as pd
import numpy as np
from sklearn.datasets import load_iris
from sklearn.metrics import accuracy_score
# Reading the Iris.csv file
data = load_iris()
# Extracting Attributes / Features
X = data.data
# Extracting Target / Class Labels
y = data.target
# Import Library for splitting data
from sklearn.model_selection import train_test_split
# Creating Train and Test datasets
X_train, X_test, y_train, y_test = train_test_split(X,y, random_state = 50, test_size = 0.25)
# Creating Decision Tree Classifier
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier()
clf.fit(X_train,y_train)
# Predict Accuracy Score
y_pred = clf.predict(X_test)
print("Train data accuracy:",accuracy_score(y_true = y_train, y_pred=clf.predict(X_train)))
print("Test data accuracy:",accuracy_score(y_true = y_test, y_pred=y_pred))

Explanation

Line 1-4: We import the necessary libraries to read and analyze the dataset.
Line 7: We store the IRIS dataset in the variable data. Since the sklearn library contains the IRIS dataset by default, you do not need to upload it again.
In Line 10:, we extract all of the attributes in variable X.
In Line 13: we extract the target, i.e., the labels in variable y.
Line 16: we import the train_test_split function.
Line 19 we implement the train_test_split() function. The parameter random_state can be randomly set to any value, but the same needs to be maintained in order to produce reproducible splits. The parameter test_size can also be manipulated based on need. Here, we use a test_size of 0.25, which indicates that we want to split the test data as 25% of the total dataset, and the remaining 75% will be assigned as training data.
Lines 22-24: we create a decision tree classifier and fit it against the training dataset. By default, the criterion parameter is set to gini.

Line 27-30: we import the “accuracy_score” module and implement the same to find the accuracy of both the training and test data.

Line 28-29, we get the output as 1, i.e., 100% for training data and 0.947, which is approximately 95%, for the test dataset.

How to build a decision tree with the IRIS dataset in Python

Description

How decision trees work

Practical implementation

Import the libraries

Gather the data

Import the required Python library and build a data frame

Create the model in Python

Use the test dataset to make a prediction

Complete code

Explanation