Machine learning: Model Evaluation - Tour of Model Evaluation Metrics

A classifier is only as good as the metric used to evaluate it. If you choose the wrong metric to evaluate your models, you are likely to choose a poor model.

In this tutorial, you will discover metrics that you can use for imbalanced classification. After completing this tutorial, you will know:

About the challenge of choosing metrics for classification, and how it is particularly difficult when there is a skewed class distribution.
How there are three main types of metrics for evaluating classifier models, referred to as rank, threshold, and probability.
How to choose a metric for imbalanced classification if you don’t know where to start.

This tutorial is divided into three parts; they are:

Challenge of Evaluation Metrics
Taxonomy of Classifier Evaluation Metrics
How to Choose an Evaluation Metric

A. Challenge of Evaluation Metrics

There are standard metrics that are widely used for evaluating classification predictive models, such as classification accuracy or classification error.

In the case of class imbalances, the problem is even more acute because the default, relatively robust procedures used for unskewed data can break down miserably when the data is skewed.

B. Taxonomy of Classifier Evaluation Metrics

We can divide evaluation metrics into three useful groups; they are:

Threshold Metrics: accuracy and F-measure
Ranking Metrics: receiver operating characteristics (ROC) and AUC
Probability Metrics: root-mean-squared error

1. Threshold Metrics for Imbalanced Classification

Perhaps the most widely used threshold metric is classification accuracy:

Accuracy = Correct Predictions/Total Predictions

And the complement of classification accuracy called classification error:

Error = Incorrect Predictions/Total Predictions

Although widely used, classification accuracy is almost universally inappropriate for imbalanced classification.

There are two groups of metrics that may be useful for imbalanced classification because they focus on one class; they are sensitivity-specificity and precision-recall.

Sensitivity = TruePositive/(TruePositive + FalseNegative)

Specificity = TrueNegative/(FalsePositive + TrueNegative)

For imbalanced classification, the sensitivity might be more interesting than the specificity. Sensitivity and Specificity can be combined into a single score that balances both concerns, called the G-mean.

G-mean = sqrt(Sensitivity × Specificity)

The F-measure is a popular metric for imbalanced classification.

F-measure = 2 × Precision × Recall/(Precision + Recall)

Precision = TruePositive/(TruePositive + FalsePositive)

Recall = TruePositive/(TruePositive + FalseNegative)

2. Ranking Metrics for Imbalanced Classification

Depiction of a ROC Curve

Depiction of a Precision-Recall Curve

3. Probabilistic Metrics for Imbalanced Classification

Probabilistic metrics are designed specifically to quantify the uncertainty in a classifier’s predictions.

Evaluating a model based on the predicted probabilities requires that the probabilities are calibrated.

Some classifiers are trained using a probabilistic framework, such as maximum likelihood estimation, meaning that their probabilities are already calibrated.

Perhaps the most common metric for evaluating predicted probabilities is log loss for binary classification (or the negative log likelihood), or known more generally as cross-entropy.

For a binary classification dataset where the expected values are y and the predicted values are yhat, this can be calculated as follows:

LogLoss = −((1 − y) × log(1 − yhat) + y × log(yhat))

The score can be generalized to multiple classes by simply adding the terms.

Another popular score for predicted probabilities is the Brier score.

The benefit of the Brier score is that it is focused on the positive class, which for imbalanced classification is the minority class.

A perfect classifier has a Brier score of 0.0.

C. How to Choose an Evaluation Metric

Given that choosing an evaluation metric is so important and there are tens or perhaps hundreds of metrics to choose from, what are you supposed to do?

Are you predicting probabilities?

-Do you need class labels?

* Is the positive class more important?

· Use Precision-Recall AUC

* Are both classes important?

· Use ROC AUC

-Do you need probabilities?

* Use Brier Score and Brier Skill Score

Are you predicting class labels?

-Is the positive class more important?

* Are False Negatives and False Positives Equally Important?

· Use F1-measure

* Are False Negatives More Important?

· Use F2-measure

* Are False Positives More Important?

· Use F0.5-measure

-Are both classes important?

* Do you have < 80%-90% Examples for the Majority Class?

· Use Accuracy

* Do you have > 80%-90% Examples for the Majority Class?

· Use G-mean

Machine learning

Menu bar

26/01/2022

Model Evaluation - Tour of Model Evaluation Metrics

No comments:

Post a Comment