How to save a machine learning model using Python's pickle module
Machine learning is a subset of artificial intelligence that involves training computers on large data sets in order to make predictions or decisions. It involves the collection of data, pre-processing it to make it suitable for training a model, model training, evaluating the model, improving it, and making predictions out of the model.
Once the model gets trained on a data set, we can save it using Python's pickle module that implements binary protocols to serialize and deserialize objects into byte streams. We use the term "pickling" when an object is converted into a byte stream, whereas we use "unpickling" when we convert a byte stream to an object.
In this Answer, we will be training a simple machine learning model, and saving and loading it so that we can make predictions out of it in the future.
Technologies used
We will be using the following technologies:
Pandas: We will be using the pandas library for converting loading the data set into a data frame.
Sklearn: We will use sklearn's (Python's machine learning library)
RandomForestClassifiermodel and train it on our data set.Pickle: We will use it to save our model and load it again in our program code.
Saving the model using the pickle module
In this section, we will be going through a step-by-step process in which we will:
Load a dataset
Split the dataset into x (features) and y (output) data frames.
Perform train test split (training data = 80%, test data = 20%)
Import the
RandomForestClassifiermodel and train it in the training data.Save the model as a binary file with
.pk1file extension.Load the saved model and perform predictions.
Loading the dataset
To apply the machine learning model, first, we need to have data set. In this Answer, we have the heart-disease-dataset.csv file that contains information related to heart diseases. The code to read the CSV file is given below:
import pandas as pd
heart_disease_df = pd.read_csv("heart-disease-dataset.csv")
print(heart_disease_df.head())Code explanation
Line 1: We import the pandas library
Line 3: Using the pandas library, we read the CSV file using the
read_csvfunction. The function reads the CSV file and converts it into pandas data frame.Line 5: We print the first five rows of the data frame using the
headfunction.
Train-test split
Now that we have loaded our data set in our program, we split the data set into features and output. The output of the data set is the "target" column, which tells whether a person has heart disease or not. Once this splitting is done, we have to perform further splitting in which we have to split our data into training and testing data. The code for the splitting is given below:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
heart_disease_df = pd.read_csv("heart-disease-dataset.csv")
print(heart_disease_df.head())
x = heart_disease_df.drop("target" , axis = 1)
y = heart_disease_df['target']
np.random.seed(0)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2)
print("x shape:", x.shape)
print("y shape:", y.shape)
print("x_train shape:", x_train.shape)
print("x_test shape:", x_test.shape)
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)Code explanation
Line 2: We import the NumPy library to fix the random seed to 0 in line 12.
Line 3: We import the
train_test_splitfunction from thesklearn.model_selectionpackage.Line 9: We remove the "target" column from the loaded data set using the
dropfunction and store the result asx(features).Line 10: We extract the "target" column from the data set and save it as
y(output).Line 13: We pass
xandyto thetrain_test_splitfunction that splits them into train and test data depending on thetest_size.Line 15–20: We print the shapes of the data for confirmation.
Applying the model
Till now, we have split our data into testing and training data. We will pass the training data to our RandomForestClassifier model and calculate the model's accuracy on the test data. The code is given below:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
heart_disease_df = pd.read_csv("heart-disease-dataset.csv")
x = heart_disease_df.drop("target" , axis = 1)
y = heart_disease_df['target']
np.random.seed(0)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2)
model = RandomForestClassifier()
model.fit(x_train, y_train)
model_accuracy = model.score(x_test, y_test)
print("Model Accuracy:" , model_accuracy * 100 , "%")Code explanation
Line 4: We import the
RandomForestClassifiermodel from thesklearn.ensemblelibrary.Line 14: We create an object of the
RandomForestClassifiermodel.Line 16: We
our model on the training data using thefit Train the model fitmethod.Line 18: We evaluate the model by passing the test data to the
scoremethod.Line 20: We display the accuracy on the screen.
Congratulations! We have created a classifying model using sklearn
Saving the model
Now that we have trained our model, we will save it using Python's pickle module. The code for it is given below:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
import pickle
heart_disease_df = pd.read_csv("heart-disease-dataset.csv")
x = heart_disease_df.drop("target" , axis = 1)
y = heart_disease_df['target']
np.random.seed(0)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2)
model = RandomForestClassifier()
model.fit(x_train, y_train)
model_accuracy = model.score(x_test, y_test)
print("Model Accuracy:" , model_accuracy * 100 , "%")
pickle.dump(model , open('heart-disease-model.pk1' , 'wb'))Code explanation
Line 5: We import the
picklemodule.Line 23: We save the model using the
dumpfunction provided by thepicklemodule. The function takes two parameters:1st parameter: The object/model that is to be saved.
2nd parameter: The method to save the file. We use the
openfunction that takes in the file name (heart-disease-model.pk1) and the mode for opening the file (wb).
Note: Use the ls command in the terminal to view the saved file.
Running the above code will save the model in a binary file that can be shared and used by loading it, as we will do in the next section.
Loading the model
The code to load the saved model is given below:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
import pickle
heart_disease_df = pd.read_csv("heart-disease-dataset.csv")
x = heart_disease_df.drop("target" , axis = 1)
y = heart_disease_df['target']
np.random.seed(0)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2)
model = RandomForestClassifier()
model.fit(x_train, y_train)
model_accuracy = model.score(x_test, y_test)
print("Model Accuracy:" , model_accuracy * 100 , "%")
pickle.dump(model , open('heart-disease-model.pk1' , 'wb'))
loaded_model = pickle.load(open('heart-disease-model.pk1' , 'rb'))
loaded_model_accuracy = loaded_model.score(x_test, y_test)
print("Loaded Model Accuracy:" , loaded_model_accuracy * 100 , "%")Code explanation
Line 25: We use the
loadfunction from thepicklemodule to load our saved machine-learning model. We use theopenfunction that takes the file name (heart-disease-model.pk1) that contains the saved model and the mode for opening the file(rb).Line 27: To confirm that the loaded model works, we pass the test data to it, which performs prediction and returns the accuracy score.
Line 29: We display the accuracy of the loaded model.
We have successfully loaded our saved model
Conclusion
pickle is a useful module that helps to save our model and load it. Saving trained models as binary files helps to share them between teams and systems, without the need to train the model again.
Free Resources