Machine learning: Object Detection Metrics With Worked Example

Average Precision (AP) and mean Average Precision (mAP) are the most popular metrics used to evaluate object detection models such as Faster R_CNN, Mask R-CNN, YOLO among others. The same metrics have also been used the evaluate submissions in competitions like COCO and PASCAL VOC challenges.

From AP, we can derive other metrics like AP50, AP75, AP[.5:.5:.95] (see Fig 1 and Fig 2 below). In fact, we derive mAP from AP.

The following Figures (Fig 1 and Fig 2) shows the usage of these metrics in some start-of-art (SOTA) models.

Fig 1: Metrics used in: Left: Mask RCNN, Right: YOLOv3

Fig 2 Some metrics used in COCO challenge

At a low-level, measuring the performance of a object detector involves determining if a detection is valid or not.

Definitions:

True Positive (TP) — A valid detection.
False Positive (FP) — An invalid detection.
False Negative (FN) — Ground-truth missed by the model.
True Negative (TN) — This metric does not apply in object detection because there are infinitely many instances that should not be detected as object.

In the context of determining the validity of a detection (predicted mask), a supporting metric called Intersection over Union (also Jaccard Index) is needed.

Intersection over Union

In object detection problems, IoU evaluates the overlap between ground-truth mask (gt) and the predicted mask (pd). It is calculated as the area of intersection between gt and pd divided by the area of the union of the two, that is,

Diagrammatically, IoU is defined as:

Fig 3: IoU

We can, therefore, redefine TP (correct detection) as a detection for which IoU≥ α and FP (invalid detection) with IoU< α. FN is a ground-truth missed by the model.

For example, at IoU threshold, α = 0.5 (or 50%), we can define TP, FP and FN as shown in the Fig 4 below.

Fig 4: IoU and IoU threshold defined in a diagram

Note: If we raise IoU threshold above 0.86, the first instance will be a FP and if we lower IoU threshold below 0.24, the second instance becomes TP.

Precision and Recall

As started earlier TN are not used in object detection problems and therefore one as to avoid metrics based on this confusion matrix component such as True Negative Rate (TNR), False Positive Rate (FPR), Negative Predictive Value (NPC) and Receiver Operating Characteristic (ROC) curve. Instead, the evaluation of object detection models based on Precision (P) and Recall (R) which are defined as

Precision is the ability of a classifier to identify relevant objects only. It is the proportion of true positive detections.

Recall, on the other hand, measures the ability of the model to find all relevant cases (that is, all ground-truths) — the proportion of true positives detected among all ground-truths.

A good model is a model that can identify most ground-truth objects (high recall) while only finding the relevant objects (high precision) often. A perfect model is the one with FN=0 (recall=1) and FP=0 (precision=1). The former is usually the objective, the latter often unattainable.

Precision x Recall Curve (PR Curve)

The precision-recall (PR) curve is a plot of precision as a function of recall. It shows trade-off between the two metrics for varying confidence values for the model detections. If the FP is low, the precision is high but more object instances may be missed yielding high FN — low recall. Conversely, if one accepts more positives by lowering IoU threshold, α, the recall will increase but false positives may also increase, decreasing the precision score. For a good model, both precision and recall should remain high even if the confidence threshold varies.

Average Precision

AP@α is, ideally, the Area Under the PR Curve (AUC-PR). Mathematically, AP is defined as

Definition of Average Precision

Notation: AP@α or APα means Average Precision(AP) at IoU threshold of α. Therefore AP50 and A75 means AP at IoU threshold of 50% and 75% respectively.

A high AUC-PR implies high precision and high recall. Naturally, often, PR curve is a zigzag-like plot (not monotonically decreasing). We often want to remove this behavior (make the curve to be monotonically decreasing) before calculating the Area Under the Curve (the Average Precision). This is done using interpolation methods. We will discuss two of those interpolation methods below:

11-point interpolation method
All-point interpolation approach

11-point interpolation method

A 11-point AP is a plot of interpolated precision scores for a model results at 11 equally spaced standard recall levels, namely, 0.0, 0.1, 0.2, . . . 1.0. It is defined as

11-point interpolation formula

where, R = {0.0, 0.1, 0.2, . . . 1.0} and

That is, interpolated precision at recall value, r — It is the highest precision for any recall value r’≥ r.

All — point interpolation method

Unlike 11-point, all-point interpolation interpolates through all the positions, that is,

Mean Average Precision (mAP)

Remark (AP and the number of classes): AP is calculated individually for each class. This means that there are as many AP values as the number of classes (loosely). These AP values are averaged to obtain the metric: mean Average Precision (mAP). Precisely, mean Average Precision (mAP) is the average of AP values over all classes.

Remark (AP and IoU): As said earlier, AP is calculated at a given IoU threshold, α. With this reasoning, AP can be calculated over a range of thresholds. Microsoft COCO, calculated AP of a given category/class at 10 different IoU ranging from 50% to 95% at 5% step-size, usually denoted, AP@[.50:.5:.95]. Mask R-CNN, reports the average of AP@[.50:.5:.95] simply as AP.

Example

Consider Fig below. It containing 3 images with 12 detections (red bounding boxes) and 9 ground truths(green boxes). Each detection is labelled with a letter and confidence of the prediction. In this example we are considering that all the ground-truths are of the same class (detecting object of the same class) and we will use IoU threshold, α = 50%. The IoU for each detection-truth pair is indicated in Table 1 below. The columns cumTP and cumFP are cumulative values for TP and FP columns respectively. It accumulate TP and FP values above the corresponding confidence level.

Fig 5: A model detecting objects of the same class. There are 12 detections and 9 ground-truths.

Remark (Multiple detections): In some cases, there are multiple detections overlapping one ground- truth, e.g. c,d in image 1, g,f in image 2 and i,k in image 3. For such cases, a detection with the highest confidence is considered as TP and the rest of the detections as FPs. This is, however, conditioned on the IoU threshold. The detection with the highest confidence should have IoU> threshold otherwise all detections are FPs. After applying this argument.

c and d becomes FPs because none of them meets the threshold requirement. c and d have 47% and 42% IoUs respectively against the required 50%.
g is a TP and f is FP. Both have IoUs greater than 50% but g has a higher confidence of 97% against the confidence of 96% for f.
What about i and k?

Multiple detections of the same object in an image were considered false detections e.g. 5 detections of a single object counted as 1 correct detection and 4 false detections.

Some detectors can output multiple detections overlapping a single ground truth. For those cases the detection with the highest confidence is considered a TP and the others are considered as FP, as applied by the PASCAL VOC 2012 challenge.

Important: Before filling up cumTP, cumFP, all_detection, precision and recall you need to sort the table values by confidence, in descending order. precision is cumTP/all_detections and recall is cumTP/number_of_ground_truths. We have 9 ground-truths.

11-point interpolation

To calculate the approximation of AP@0.5 using 11-point interpolation, we need to average the precision values for recall values in R (see Equation 3) , that is, for recall values 0.0, 0.1, 0.2, . . . 1.0. as shown in Fig Right below.

Fig 6: Plot of all precision-recall values

Fig 6: 11 precision values matching the following recall values: 0.0, 0.1, …, 1.0.

All-point interpolation

From the definition in Equation above, we can calculate AP@50 using all-point interpolation as follows

Fig 7: All point interpolation curve(in red) overlaying the original PR curve

Fig 7: Regions for all-point interpolation

Put simply, all-point interpolation involves calculating and summing area values of the 4 regions (R1, R2, R3 and R4) in Figure 1.3b,that is,

Remark: Recall that, we said that AP is calculated for each class. Our calculations is for one class. For several classes we can calculate mAP by simply taking the average of the resulting AP values for the different class.

Conclusion

You have also noticed that IoU is an important concept becasue we can define none of these metrics without defining threshold based on IoU. We have also learnt that AP is calculated per class and the average of the resulting AP values is the mAP. AP can also be calculated for different IoU threshold.

It is also important to note that the prediction mask may not necessarily be a rectangular box. It can come in any shape, specifically, taking the shape of the object being detected.