Sklearn is an open-source Python machine-learning library that provides the essential tools to perform machine-learning tasks. In classification tasks, predict()
and predict_proba()
. Although both methods work on the same goal of predicting a category, they do have differences with respect to the output they provide, which will be covered in this Answer.
predict()
vs predict_proba()
The predict()
method is used to predict a category for a set of input features. It returns a discrete value that can be directly assigned to each input feature.
In the illustration below, we can see a model that predicts if a picture is that of a cat or a dog. We have input an image to the model, and the model uses the predict()
method. The model outputs a discrete output (cat) which can be associated with the input image (feature).
On the other hand, the predict_proba()
method returns the predicted probabilities of the input features belonging to each category. The method, instead of returning a discrete class, returns the probabilities associated with each class. This is useful when not only do we want to know the category of the input features, but we also want to know the model's confidence in its prediction.
For the same cat and dogs prediction example, the predict_proba()
returns the probabilities of each category (cat and dog), i.e., 0.95 and 0.05, respectively.
Both methods predict the categories, but the way they return the output is different. predict()
returns a single discrete category that it has predicted. Whereas, predict_proba()
returns continuous values that represent the likelihood of each input belonging to each class. In case of predict_proba()
we need to manually extract the class with the highest probability and assign it to an input instance.
To explain how both of the methods differ from each other, we will be using the following LogisticRegression
problem in which we want to predict the iris name.
The dataset we use in the example is from the sklearn's dataset.
It has 3 different types of iris names i.e., Setosa, Versicolour, and Virginica.
The column names are Sepal Length, Sepal Width, Petal Length, and Petal Width.
The data set has 150 instances with 5 columns (including the target column).
The code implementation shows how to train a LogisticRegression
model on the iris data set. The model learns the patterns from the dataset and then can predict the iris name depending on the input features. We will use predict()
and predict_proba()
to make predictions in this example. The code can be seen below:
from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression from sklearn.datasets import load_iris # Load the Iris dataset iris_data = load_iris() X = iris_data.data y = iris_data.target # Split the dataset into training and testing sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # Train a logistic regression model model = LogisticRegression() model.fit(X_train, y_train) # Predict using 'predict' predicted_labels = model.predict(X_test) # Predict using 'predict_proba' predicted_probabilities = model.predict_proba(X_test) print("Predicted Probabilities:") for probs in predicted_probabilities: formatted_probs = [f"{prob:.2f}" for prob in probs] print(formatted_probs)
Lines 1–3: We import the train_test_split
function, the LogisticRegression
model and the load_iris
dataset.
Line 6: We create an instance of the data set and store it in the iris_data
variable. The following are the main attributes of the instance that we will be dealing with:
data
: An array that contains the data of the complete data set.
target
: An array that contains the target data corresponding to each row of the data array.
target_names
: An array containing the iris names (features/categories)
Line 7: We make a features label X
to which we assign the feature data from the data array.
Line 8: We make the target/output label y
and fill it with the target data.
Line 11: We split the data into training and testing data leveraging the train_test_split
function. We keep our test size data to 20% of the whole data. The random_state
attribute helps to fix the randomness of the splitting so that whenever we perform splitting, the data splits into the same data.
Lines 14–15: We create an instance of the LogisticRegression
model and train it on the training data using the fit()
method.
Now that our model is trained, we can make predictions using predict()
and predict_proba()
. The next two code line explanations will explain them:
Line 18: We use the predict()
function and pass the test data on which we want to perform the predictions. The function returns in discrete values, either 0, 1, or 2 (setosa, versicolor, or virginica, respectively).
Line 21: We use the predict_proba()
function, that returns probabilities of each class ([probability 1, probability_2, probability_3]).
The sum of probabilities will be equal to 1, i.e., probaility_1 + probaility_2 + probaility_3 = 1
In conclusion, the predict
method returns the predicted class labels for input instances, while the predict_proba
method returns the predicted probabilities of each class for the input instances. The former is useful for obtaining discrete class predictions, while the latter is valuable when we need to understand the model's confidence or uncertainty in its predictions.