How do anchor boxes work?

Anchor boxes are predefined bounding boxes of various shapes and sizes that help detect objects with different aspect ratios by adjusting and refining their dimensions during training to match the ground truth boxes closely. Let’s learn how they work in a pipeline.

Calculating the size of an anchor box

Picking up anchors that represent our data is extremely important because YOLO learns to make adjustments to these anchor boxes to predict a bounding box for an object. Here are the steps we need to follow to calculate the anchor box size:

  1. Get bounding boxes’ dimensions from the training data: Since we need to find out the height and width of the anchors, we first determine the height and width of all the bounding boxes in the training data.

  2. Cluster the bounding boxes: YOLO employs a grid-based approach for object detection. To illustrate, in YOLOv3, an image of 416 × 416 dimensions is partitioned into three grids of sizes 13 × 13, 26 × 26, and 52 × 52.

Let’s consider that we have three anchor boxes for each grid cell. Given that YOLO makes predictions at three scales—small, medium, and large— this means that we have a total of nine anchor boxes (three boxes per scale).

Now, the question is how are these nine anchors assigned to the three grids? The assignment process depends on the size of the anchor boxes as follows:

  1. The three largest anchor boxes are assigned to the grid with the largest cells.

  2. Conversely, the three smallest anchor boxes are allocated to the grid with the smallest cells.

Metrics used in k-means clustering

Instead of using the Euclidean distance as a metric, anchor boxes employ IoU scores. The aim is to maximize IoU scores for more precise predictions. To determine these initial anchor boxes, we use k-means clustering, which groups the bounding boxes according to their size and aspect ratios. The centroids of these clusters then become our initial anchor boxes, giving us a useful starting point for object detection.

Genetic evolution (GE) algorithm/ auto-anchor concept

Beginning with YOLOv5, a novel concept known as auto-anchor was introduced. This technique was designed to optimize anchor boxes more effectively after they’re generated through k-means clustering. Here is how it works:

  1. Auto-anchor runs before the training process to assess if the k-means generated anchor boxes are suitable for the given dataset.

  2. If the initial anchors are suboptimal, the algorithm computes and evolves new anchors, which are then automatically incorporated into the model.

  3. The algorithm accomplishes this by randomly adjusting one or more characteristics of the anchor box, such as the aspect ratio, hence, generating new potential anchors.

  4. A fitness score is calculated for each new anchor with the Complete Intersection over Union (CIoU) loss and Best Possible Recall (BPR) as the evaluation metrics.

  5. K-means centroids generated in the previous steps are used as initial conditions for the GE algorithm.

Complete Intersection over Union (CIoU)

The Complete Intersection over Union (CIoU) metric is a refined variant of the conventional IoU metric, often employed in assessing the similarity between two bounding boxes during object detection tasks. While the traditional IoU metric focuses exclusively on the overlap between the predicted and ground truth bounding boxes, CIoU introduces more nuanced considerations.

Specifically, CIoU takes into account additional factors like the aspect ratio and the distance between the centers of the bounding boxes (both the anchor box and the ground truth box). This additional level of detail makes CIoU a more comprehensive and informative metric, providing a richer comparison of bounding boxes. As a result, CIoU is often preferred when a more thorough assessment of bounding box accuracy is needed.

Best Possible Recall (BPR)

Recall, also known as sensitivity or true positive rate, is the proportion of actual positive instances that are correctly identified by a classifier. Here are some of its features:

  • It measures the maximum recall that can be achieved by a model given a fixed number of anchors.

  • The idea is to generate anchor boxes achieving high recall with a minimum number of boxes.

  • A high recall is important because it means that the model can detect most of the objects. There may be some false positives with high recall, but our model can be further improved to negate that.

Get hands-on with 1200+ tech skills courses.