Understanding predict_proba() from MultiOutputClassifier
The predict_proba method is commonly used in machine learning to obtain probability estimates for the different possible outcomes or classes of a classification problem. When working with a multioutput classification problem, such as using MultiOutputClassifier in scikit-learn, predict_proba can obtain probability estimates for each output variable or dimension.
Here are some key methods and concepts associated with predict_proba to help us understand this method in greater detail, starting with the definition of multioutput classification.
Note: To learn more about the
scikit-learnlibrary, check out this Answer.
What is multioutput classification?
Multioutput classification is a type of supervised learning in which we have multiple target variables, each with its own set of possible classes or labels. In multilabel classification, an input can belong to multiple classes simultaneously, or, in other words, a particular input can have multiple labels.
The MultiOutputClassifier method
MultiOutputClassifier is a wrapper in scikit-learn that allows us to apply a single classifier to each output variable in a multioutput classification problem. It treats each output variable as an independent binary classification problem and uses the specified base classifier.
The syntax for this wrapper is given below:
multi_output_classifier = MultiOutputClassifier(base_classifier)
Note: The
base_classifiermethod can be any supervised learning algorithm such as the Random Forest algorithm.
Now, let's discuss the steps to implement this method.
Step 1: Create a sample dataset
Firstly, we will create a sample dataset which consists of 100 samples with four features and two outputs with each output having a class label from a possible three in total:
# Define the number of samplesnum_samples = 100# Generate random features (X); here 4 random features are being generatedX = np.random.rand(num_samples, 4)# Generate random target values (y) for two output variables, each with three classesnum_classes = 3y1 = np.random.randint(0, num_classes, size=num_samples)y2 = np.random.randint(0, num_classes, size=num_samples)
A diagram visualizing the setup of the model is given below for a better understanding of this example:
After this, we split the dataset into training and test datasets using the train_test_split library in a 80-20 ratio:
X_train, X_test, y_train, y_test = train_test_split(X, np.column_stack((y1, y2)), test_size=0.2)
Step 2: Create a multioutput classifier
Next, we initalize our multioutput classifier with the aid of a base classifier, which is taken to be the RandomForestClassifier for this specific coding example.
# Create a multioutput classifier using a base classifier (e.g., RandomForest)base_classifier = RandomForestClassifier()multi_output_classifier = MultiOutputClassifier(base_classifier)
The base classifier is passed to MultiOutputClassifier so it is now ready to be fitted by our sample dataset.
Step 3: Fit the sample dataset
Next, we will use predict_proba. After fitting the training data to the multi_output_classifier, the predict_proba method is fitted onto the test data to generate output probabilities for each output variable:
# Fit the multioutput classifier to the training datamulti_output_classifier.fit(X_train, y_train)# Use the predict_proba method to get probability estimates for each output variableprobabilities = multi_output_classifier.predict_proba(X_test)print(probabilities)
Note: The
predict_probamethod only takes one input parameter, which is the test data itself.
Code example
The code using the predict_proba method is given below:
from sklearn.multioutput import MultiOutputClassifierfrom sklearn.ensemble import RandomForestClassifierfrom sklearn.model_selection import train_test_splitimport numpy as np# Define the number of samplesnum_samples = 100# Generate random features (X); here 4 random features are being generatedX = np.random.rand(num_samples, 4)# Generate random target values (y) for two output variables, each with three classesnum_classes = 3y1 = np.random.randint(0, num_classes, size=num_samples)y2 = np.random.randint(0, num_classes, size=num_samples)# Create a sample training and test datasetX_train, X_test, y_train, y_test = train_test_split(X, np.column_stack((y1, y2)), test_size=0.2)# Create a multioutput classifier using a base classifier (e.g., RandomForest)base_classifier = RandomForestClassifier()multi_output_classifier = MultiOutputClassifier(base_classifier)# Fit the multioutput classifier to the training datamulti_output_classifier.fit(X_train, y_train)# Use the predict_proba method to get probability estimates for each output variableprobabilities = multi_output_classifier.predict_proba(X_test)print(probabilities)
Code explanation
The line-by-line code explanation is given below:
Lines 6–15: We start off by creating a multidimensional dataset since the
MultiOutputClassifierwrapper can only work with multidimensional datasets. For this example, we take four features, having hundred samples overall. Also, we generate random label values for two output variablesy1andy2, each having three classes to choose from. Once the input data and output labels are generated, we split them into training and test datasets, having a 80-20 split.Lines 17–19: Next, we use a base classifier to create a multioutput classifier. Here, we use the
RandomForestClassifierto create the multioutput classifier.Lines 21–25: Finally, we fit the training data to the multioutput classifier in order to get the probability estimates for two of the output variables using the
predict_probamethod. The test data is used for this purpose. The probabilites are stored in theprobabilitiesvariable, which is then printed.
Keep in mind that the structure of the probability arrays may vary depending on the specific classifier we are using. For example, some classifiers may return probabilities as an array of shape (n_samples, n_classes), while others may return them differently.
Output
If we look at the probabilites output, we can see that each row corresponds to each test sample having three different probability samples, which makes sense because we have three classes for the input dataset.
[[0.6, 0.4, 0. ],[0.3, 0.6, 0.1],[0. , 1. , 0. ],...]
Conclusion
Overall, predict_proba in the MultiOutputClassifier wrapper allows us to obtain probability estimates for each output variable in a multioutput classification problem. This helps to evaluate the model’s confidence in its predictions for each dimension or target variable.
Free Resources