Classification metrics are used to evaluate and quantify the performance of a classifier. Below are the most commonly used classification metrics by every Data Scientist:
Confusion matrices visualize the count of labels that are correctly and incorrectly classified. The confusion matrix highlights the performance of our model on each class and identifies what kind of errors our model is exhibiting. The four quantities in the matrix are as follows:
You can plot the confusion matrix from the scikit-learn’s metrics module as shown below:
from sklearn.metrics import confusion_matrix
import seaborn as sns
cm = confusion_matrix(test_Y, pred_Y)
sns.heatmap(cm, annot=True, fmt='d')
Accuracy is the most widely used classification metric to evaluate the performance of a classifier. Accuracy tells the proportion of observations that are classified correctly and is defined as follows:
is the total number of observations.
You can understand accuracy through the following question: Out of all the predictions, how many did we predict correctly?
You can calculate the accuracy score of your model through scikit-learn’s metrics module:
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(test_Y, pred_Y)
While accuracy is the most commonly used classification metric, it is not suitable when we have class-imbalance. For example, when predicting rare threats in an Intrusion Detection System, simply predicting that there is no threat would grant our classifier very high accuracy. But that is not a meaningful classifier as it does not predict any threats.
Precision is also a widely used metric that tells how precise our classifier is.
You can understand precision through the following question: Out of all the observations that were predicted to be positive, what proportion was actually positive?
Numerically, precision is defined as follows:
We can observe from the formula that precision penalizes false positives. For example, if we want to determine whether someone is guilty of a crime, we would want fewer false positives as we do not want to convict an innocent person. Therefore, precision would be a major evaluation metric in this scenario.
You can calculate the precision score of your model through scikit-learn’s metrics module:
from sklearn.metrics import precision_score
precision = precision_score(test_Y, pred_Y)
Recall is used to determine the proportion of true positives that are correctly classified.
You can understand recall through the following question: Out of all observations that were actually positive, what proportion was predicted as positive?
Numerically, recall is defined as follows:
Recall determines how good our classifier is at detecting positives. We can observe from the formula that recall penalizes false negatives. For example, if we want to determine if a person has a major disease like cancer, we want to have fewer false negatives as we do not want to leave a person with cancer untreated. Therefore, recall would be a major evaluation metric in this scenario.
You can calculate the recall score of your model through scikit-learn’s metrics module:
from sklearn.metrics import recall_score
recall = recall_score(test_Y, pred_Y)
NOTE: Recall is also referred to as True Positive Rate (TPR).
Regardless of the input, we can achieve 100% recall by making our classifier always output 1. There would be no false negatives, but many false positives, and hence low precision. This shows that there is a certain tradeoff between these two metrics. Ideally, we would want both recall and precision to be 100%, but that is rarely the case, and we may have to give priority to one metric over the other.
For that, we can adjust the classification threshold according to our problem:
Suppose we want a classifier that penalizes both false positives and false negatives equally. Then, the F1 score is the evaluation metric to consider. Numerically, the F1 score is defined as follows:
We can observe that the F1 score maintains a balance between precision and recall. For example, we want an Intrusion Detection System that is both good at detecting threats (recall) and does not raise false alarms (precision); the F1 score is the major evaluation metric in this scenario.
You can calculate the F1 score of your model through scikit-learn’s metrics module:
from sklearn.metrics import f1_score
f1 = f1_score(test_Y, pred_Y)
False Positive Rate determines the proportion of observations that are misclassified as positive. Numerically, FPR is defined as follows:
You can think of False Positive Rate through the following question: What proportion of innocent people did I convict?
AUC is defined as Area Under the ROC curve. A Receiver Operating Characteristic (ROC) curve plots True Positive Rate vs. False Positive Rate. Increasing the classification threshold means both TPR and FPR decrease.
We see a tradeoff here as well. Ideally, we want a TPR of 1 and an FPR of 0. We want the ROC curve to be as close to the top left of the graph as possible. We can compute the Area Under the ROC curve (AUC), which determines how well the positive class is separated from the negative class.
You can calculate the AUC score of your model through scikit-learn’s metrics module:
from sklearn.metrics import roc_auc_score
auc = roc_auc_score(test_Y, pred_Y)
RELATED TAGS
CONTRIBUTOR
View all Courses