**Sklearn** is an open-source Python machine-learning library that provides the essential tools to perform machine-learning tasks. In classification tasks, **predictive modeling**** predict()** and

`predict_proba()`

`predict()`

vs `predict_proba()`

The

`predict()`

method is used to predict a category for a set of input features. It returns a discrete value that can be directly assigned to each input feature.In the illustration below, we can see a model that predicts if a picture is that of a cat or a dog. We have input an image to the model, and the model uses the

`predict()`

method. The model outputs a discrete output (cat) which can be associated with the input image (feature).

On the other hand, the

`predict_proba()`

method returns the predicted probabilities of the input features belonging to each category. The method, instead of returning a discrete class, returns the probabilities associated with each class. This is useful when not only do we want to know the category of the input features, but we also want to know the model's confidence in its prediction.For the same cat and dogs prediction example, the

`predict_proba()`

returns the probabilities of each category (cat and dog), i.e., 0.95 and 0.05, respectively.

Both methods predict the categories, but the way they return the output is different. `predict()`

returns a single discrete category that it has predicted. Whereas, `predict_proba()`

returns continuous values that represent the likelihood of each input belonging to each class. In case of `predict_proba()`

we need to manually extract the class with the highest probability and assign it to an input instance.

To explain how both of the methods differ from each other, we will be using the following `LogisticRegression`

problem in which we want to predict the iris name.

The dataset we use in the example is from the sklearn's dataset.

It has 3 different types of iris names i.e., Setosa, Versicolour, and Virginica.

The column names are Sepal Length, Sepal Width, Petal Length, and Petal Width.

The data set has 150 instances with 5 columns (including the target column).

The code implementation shows how to train a `LogisticRegression`

model on the iris data set. The model learns the patterns from the dataset and then can predict the iris name depending on the input features. We will use `predict()`

and `predict_proba()`

to make predictions in this example. The code can be seen below:

from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression from sklearn.datasets import load_iris # Load the Iris dataset iris_data = load_iris() X = iris_data.data y = iris_data.target # Split the dataset into training and testing sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # Train a logistic regression model model = LogisticRegression() model.fit(X_train, y_train) # Predict using 'predict' predicted_labels = model.predict(X_test) # Predict using 'predict_proba' predicted_probabilities = model.predict_proba(X_test) print("Predicted Probabilities:") for probs in predicted_probabilities: formatted_probs = [f"{prob:.2f}" for prob in probs] print(formatted_probs)

Executing predict and predict_proba

**Lines 1–3:**We import the`train_test_split`

function, the`LogisticRegression`

model and the`load_iris`

dataset.**Line 6:**We create an instance of the data set and store it in the`iris_data`

variable. The following are the main attributes of the instance that we will be dealing with:`data`

: An array that contains the data of the complete data set.`target`

: An array that contains the target data corresponding to each row of the data array.`target_names`

: An array containing the iris names (features/categories)

**Line 7**: We make a features label`X`

to which we assign the feature data from the data array.**Line 8**: We make the target/output label`y`

and fill it with the target data.**Line 11:**We split the data into training and testing data leveraging the`train_test_split`

function. We keep our test size data to 20% of the whole data. The`random_state`

attribute helps to fix the randomness of the splitting so that whenever we perform splitting, the data splits into the same data.**Lines 14–15:**We create an instance of the`LogisticRegression`

model and train it on the training data using the`fit()`

method.

Now that our model is trained, we can make predictions using `predict()`

and `predict_proba()`

. The next two code line explanations will explain them:

**Line 18:**We use the`predict()`

function and pass the test data on which we want to perform the predictions. The function returns in discrete values, either 0, 1, or 2 (setosa, versicolor, or virginica, respectively).**Line 21:**We use the`predict_proba()`

function, that returns probabilities of each class ([probability 1, probability_2, probability_3]).

The sum of probabilities will be equal to 1, i.e., probaility_1 + probaility_2 + probaility_3 = 1

In conclusion, the `predict`

method returns the predicted class labels for input instances, while the `predict_proba`

method returns the predicted probabilities of each class for the input instances. The former is useful for obtaining discrete class predictions, while the latter is valuable when we need to understand the model's confidence or uncertainty in its predictions.

Copyright ©2024 Educative, Inc. All rights reserved

TRENDING TOPICS