Difference between predict and predict_proba in Sklearn
Sklearn is an open-source Python machine-learning library that provides the essential tools to perform machine-learning tasks. In classification tasks, predict() and predict_proba(). Although both methods work on the same goal of predicting a category, they do have differences with respect to the output they provide, which will be covered in this Answer.
predict() vs predict_proba()
The
predict()method is used to predict a category for a set of input features. It returns a discrete value that can be directly assigned to each input feature.In the illustration below, we can see a model that predicts if a picture is that of a cat or a dog. We have input an image to the model, and the model uses the
predict()method. The model outputs a discrete output (cat) which can be associated with the input image (feature).
On the other hand, the
predict_proba()method returns the predicted probabilities of the input features belonging to each category. The method, instead of returning a discrete class, returns the probabilities associated with each class. This is useful when not only do we want to know the category of the input features, but we also want to know the model's confidence in its prediction.For the same cat and dogs prediction example, the
predict_proba()returns the probabilities of each category (cat and dog), i.e., 0.95 and 0.05, respectively.
Key difference
Both methods predict the categories, but the way they return the output is different. predict() returns a single discrete category that it has predicted. Whereas, predict_proba() returns continuous values that represent the likelihood of each input belonging to each class. In case of predict_proba() we need to manually extract the class with the highest probability and assign it to an input instance.
Coding example
To explain how both of the methods differ from each other, we will be using the following LogisticRegression problem in which we want to predict the iris name.
Dataset
The dataset we use in the example is from the sklearn's dataset.
It has 3 different types of iris names i.e., Setosa, Versicolour, and Virginica.
The column names are Sepal Length, Sepal Width, Petal Length, and Petal Width.
The data set has 150 instances with 5 columns (including the target column).
Implementation
The code implementation shows how to train a LogisticRegression model on the iris data set. The model learns the patterns from the dataset and then can predict the iris name depending on the input features. We will use predict() and predict_proba() to make predictions in this example. The code can be seen below:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
# Load the Iris dataset
iris_data = load_iris()
X = iris_data.data
y = iris_data.target
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train a logistic regression model
model = LogisticRegression()
model.fit(X_train, y_train)
# Predict using 'predict'
predicted_labels = model.predict(X_test)
# Predict using 'predict_proba'
predicted_probabilities = model.predict_proba(X_test)
print("Predicted Probabilities:")
for probs in predicted_probabilities:
formatted_probs = [f"{prob:.2f}" for prob in probs]
print(formatted_probs)
Code explanation
Lines 1–3: We import the
train_test_splitfunction, theLogisticRegressionmodel and theload_irisdataset.Line 6: We create an instance of the data set and store it in the
iris_datavariable. The following are the main attributes of the instance that we will be dealing with:data: An array that contains the data of the complete data set.target: An array that contains the target data corresponding to each row of the data array.target_names: An array containing the iris names (features/categories)
Line 7: We make a features label
Xto which we assign the feature data from the data array.Line 8: We make the target/output label
yand fill it with the target data.Line 11: We split the data into training and testing data leveraging the
train_test_splitfunction. We keep our test size data to 20% of the whole data. Therandom_stateattribute helps to fix the randomness of the splitting so that whenever we perform splitting, the data splits into the same data.Lines 14–15: We create an instance of the
LogisticRegressionmodel and train it on the training data using thefit()method.
Now that our model is trained, we can make predictions using predict() and predict_proba(). The next two code line explanations will explain them:
Line 18: We use the
predict()function and pass the test data on which we want to perform the predictions. The function returns in discrete values, either 0, 1, or 2 (setosa, versicolor, or virginica, respectively).Line 21: We use the
predict_proba()function, that returns probabilities of each class ([probability 1, probability_2, probability_3]).
The sum of probabilities will be equal to 1, i.e., probaility_1 + probaility_2 + probaility_3 = 1
Conclusion
In conclusion, the predict method returns the predicted class labels for input instances, while the predict_proba method returns the predicted probabilities of each class for the input instances. The former is useful for obtaining discrete class predictions, while the latter is valuable when we need to understand the model's confidence or uncertainty in its predictions.
Free Resources