Related Tags

data science
classification

# What are the most commonly used classification metrics?

Abdul Monum

Classification metrics are used to evaluate and quantify the performance of a classifier. Below are the most commonly used classification metrics by every Data Scientist:

## Confusion matrix

Confusion Matrix

Confusion matrices visualize the count of labels that are correctly and incorrectly classified. The confusion matrix highlights the performance of our model on each class and identifies what kind of errors our model is exhibiting. The four quantities in the matrix are as follows:

• True Positives: Correctly classify an observation as positive.
• True Negatives: Correctly classify an observation as negative.
• False Positives: Observation is negative, but predicted as positive. You can think of it as a false alarm.
• False Negatives: Observation is positive, but predicted as negative. You can think of it as failing to detect things.

You can plot the confusion matrix from the scikit-learn’s metrics module as shown below:

from sklearn.metrics import confusion_matrix
import seaborn as sns
cm = confusion_matrix(test_Y, pred_Y)
sns.heatmap(cm, annot=True, fmt='d')


## Accuracy

Accuracy is the most widely used classification metric to evaluate the performance of a classifier. Accuracy tells the proportion of observations that are classified correctly and is defined as follows:

$accuracy=\frac{TP+TN}{n}$

$n$ is the total number of observations.

You can understand accuracy through the following question: Out of all the predictions, how many did we predict correctly?

You can calculate the accuracy score of your model through scikit-learn’s metrics module:

from sklearn.metrics import accuracy_score
accuracy = accuracy_score(test_Y, pred_Y)


While accuracy is the most commonly used classification metric, it is not suitable when we have class-imbalance. For example, when predicting rare threats in an Intrusion Detection System, simply predicting that there is no threat would grant our classifier very high accuracy. But that is not a meaningful classifier as it does not predict any threats.

## Precision

Precision is also a widely used metric that tells how precise our classifier is.

You can understand precision through the following question: Out of all the observations that were predicted to be positive, what proportion was actually positive?

Numerically, precision is defined as follows: $precision=\frac{TP}{TP+FP}$

We can observe from the formula that precision penalizes false positives. For example, if we want to determine whether someone is guilty of a crime, we would want fewer false positives as we do not want to convict an innocent person. Therefore, precision would be a major evaluation metric in this scenario.

You can calculate the precision score of your model through scikit-learn’s metrics module:

from sklearn.metrics import precision_score
precision = precision_score(test_Y, pred_Y)


## Recall

Recall is used to determine the proportion of true positives that are correctly classified.

You can understand recall through the following question: Out of all observations that were actually positive, what proportion was predicted as positive?

Numerically, recall is defined as follows:

$recall=\frac{TP}{TP+FN}$

Recall determines how good our classifier is at detecting positives. We can observe from the formula that recall penalizes false negatives. For example, if we want to determine if a person has a major disease like cancer, we want to have fewer false negatives as we do not want to leave a person with cancer untreated. Therefore, recall would be a major evaluation metric in this scenario.

You can calculate the recall score of your model through scikit-learn’s metrics module:

from sklearn.metrics import recall_score
recall = recall_score(test_Y, pred_Y)


NOTE: Recall is also referred to as True Positive Rate (TPR).

### Tradeoff between precision and recall

Regardless of the input, we can achieve 100% recall by making our classifier always output 1. There would be no false negatives, but many false positives, and hence low precision. This shows that there is a certain tradeoff between these two metrics. Ideally, we would want both recall and precision to be 100%, but that is rarely the case, and we may have to give priority to one metric over the other.

For that, we can adjust the classification threshold according to our problem:

• Higher threshold: Fewer false positives and thus high precision
• Lower threshold: Fewer false negatives and thus high recall

## F1 Score

Suppose we want a classifier that penalizes both false positives and false negatives equally. Then, the F1 score is the evaluation metric to consider. Numerically, the F1 score is defined as follows:

$f1=2*\frac{precision*recall}{ precision+recall}$

We can observe that the F1 score maintains a balance between precision and recall. For example, we want an Intrusion Detection System that is both good at detecting threats (recall) and does not raise false alarms (precision); the F1 score is the major evaluation metric in this scenario.

You can calculate the F1 score of your model through scikit-learn’s metrics module:

from sklearn.metrics import f1_score
f1 = f1_score(test_Y, pred_Y)


## False Positive Rate (FPR)

False Positive Rate determines the proportion of observations that are misclassified as positive. Numerically, FPR is defined as follows:

$FPR=\frac{FP}{FP+TN}$

You can think of False Positive Rate through the following question: What proportion of innocent people did I convict?

## ROC Curve and AUC

ROC Curve

AUC is defined as Area Under the ROC curve. A Receiver Operating Characteristic (ROC) curve plots True Positive Rate vs. False Positive Rate. Increasing the classification threshold means both TPR and FPR decrease.

• Decreased TPR means detecting fewer positives.
• Decreased FPR means fewer false positives.

We see a tradeoff here as well. Ideally, we want a TPR of 1 and an FPR of 0. We want the ROC curve to be as close to the top left of the graph as possible. We can compute the Area Under the ROC curve (AUC), which determines how well the positive class is separated from the negative class.

• Best possible AUC is $1$.
• Worst possible AUC is $0.5$. Classifiers that predict randomly have AUC around 0.5 because they are just guessing and do not know the separation between positive and negative classes.
• Your classifier would have an AUC between 0.5 and 1.

You can calculate the AUC score of your model through scikit-learn’s metrics module:

from sklearn.metrics import roc_auc_score
auc = roc_auc_score(test_Y, pred_Y)


RELATED TAGS

data science
classification

CONTRIBUTOR

Abdul Monum 