How does YOLO Handle Multi-Scale Predictions

In object detection, multi-scale predictions refer to the process of identifying objects of various sizes within an image. This is crucial for achieving high detection accuracy because objects in real-world images can appear in different scales due to factors like distance, perspective, and size.

YOLO’s grid division

YOLO employs a unique approach to object detection by segmenting the input image into a grid, such as 13 × 13 or 19 × 19 cells. Each grid cell is responsible for predicting objects, specifically those whose center falls within the confines of that cell.

Moreover, each cell in the grid predicts multiple bounding boxes, but they’re designed to detect only one object per bounding box. The idea is that each cell predicts bounding boxes and associated class probabilities, but only the bounding box with the highest confidence score with its center within the cell is considered for that particular object.

However, if multiple objects’ centers fall within the same cell, the cell might struggle to accurately predict both objects.

  • Anchor boxes: To cater to the diverse shapes and sizes of objects, YOLO integrates the concept of anchor boxes. These are essentially pre-defined bounding box shapes.

  • Multiple bounding box predictions: Leveraging the predefined shapes of the anchor boxes, each grid cell is designed to forecast multiple bounding boxes. This amplifies the model’s capacity to detect objects across a spectrum of sizes.

Get hands-on with 1200+ tech skills courses.