Object Detection

Explore object detection fundamentals including bounding boxes and confidence scores. Understand classical methods, CNN detectors like Faster R-CNN and YOLO, and transformer-based models such as DETR, DINO, and Grounding-DINO. Gain hands-on experience using Hugging Face pipelines to run object detection on images efficiently.

We'll cover the following...

From classical computer vision to deep learning
- Two-stage vs. one-stage object detectors
The DETR revolution
Direct vs. indirect transformer usage
- 1. Direct transformers
- 2. Indirect transformers
Modern transformer detectors
Object detection with Hugging Face pipelines
Benchmark datasets
Where detection meets segmentation
Try it yourself
Summary

Object detection is a core computer vision task that enables machines to identify what objects appear in an image and where they are located. Unlike image classification, which assigns a single label to an entire image, object detection produces a set of detected objects, each with a bounding box and a confidence score.

This capability is foundational to many real-world applications, including:

Autonomous driving and traffic monitoring
Retail shelf analytics
Medical imaging and diagnostics
Industrial inspection and robotics

From classical computer vision to deep learning

Before deep learning, object detection relied on manually crafted features.

Techniques such as Haar Cascades and Histogram of Oriented Gradients (HOG) searched for edges, textures, and patterns defined by humans. While these methods were fast, they were fragile—their performance dropped sharply when objects appeared under different lighting conditions, angles, or backgrounds. CNN-based detectors transformed this approach. Convolutional layers automatically learn relevant features directly from data, shifting image understanding from manual feature engineering to end-to-end learning.

Two-stage vs. one-stage object detectors

Academic research and industry quickly converged on two families of architectures:

Two-stage models (R-CNN → Fast R-CNN → Faster R-CNN):
- They first generate region proposals, then classify them. This two-step process makes them highly accurate and reliable for medical imaging, satellite data, and scientific analysis, where missing an object can be costly.
One-stage models (SSD, YOLO):
- They skip proposals and predict boxes + labels in one pass. This makes them fast and real-time, ideal for drones, robotics, traffic cameras, and mobile apps.

Fun fact: YOLO-v1 (2015) was trained on a single consumer GPU and still ran in real-time; this achievement helped kickstart modern real-world applications of computer vision.

This era, which spanned from 2015 to 2020, remains the backbone of many enterprise systems today.

The DETR revolution

In 2020, DEtection TRansformer (DETR) introduced a radical idea: treat detection as set prediction, not box proposal generation. Instead of searching across thousands of anchor boxes or candidate regions, DETR uses:

A backbone (CNN or Vision Transformer) to extract features.
A transformer encoder to model relationships across the entire image.
A transformer decoder with object queries that directly predict bounding boxes and labels.
A Hungarian matching step to pair predicted objects with ground truth.

No anchors, no proposal networks, no non-max suppression. Simpler training objective, more robust predictions, and better object context modeling.

Fun fact: DETR often looks “bad” early in training, but suddenly improves after many epochs, a classic example of transformers learning global structure over time.

Direct vs. indirect transformer usage

In computer vision, transformers can operate in two fundamentally different ways depending on the task.

1. Direct transformers

In this setup, the model receives the image directly as a sequence of tokens. The image is divided into fixed-size patches (e.g., 16×16), each patch is embedded into a vector, and these vectors form the input sequence. The transformer then reasons over the entire image at once.
This approach is ideal for image-level tasks, such as classification, where the goal is simply to answer:

What is in this image?

Because the model perceives the entire image within a unified context, ViTs excel at recognizing global patterns, such as object categories, textures, and abstract shapes.

2. Indirect transformers

Indirect transformer models still rely on transformers, but they use them after extracting local visual features.
A CNN or ViT backbone first converts the image into a feature map.
Then a transformer decoder receives a set of learned “object queries” and predicts:

object classes
bounding box coordinates

Instead of scanning thousands of anchor boxes, these models treat detection as set prediction, reducing complexity and improving semantic reasoning.

This approach answers:

What is in the image, and where is it?

Indirect transformer systems are utilized for object detection, counting, and localization, where spatial relationships are crucial.

Fun fact: DETR was so radically different from classic detectors that many researchers initially thought it wouldn’t scale, later versions like DINO and Grounding DINO proved the opposite.

Modern transformer detectors

Research in object detection did not stop with DETR. Several new and improved transformer-based detectors address limitations in training speed, generalization, and flexibility:

DINO improves object queries and uses self-distillationChatGPT said: Self-distillation is a technique where a model learns from its own predictions to improve its accuracy and generalization., resulting in higher accuracy.
Grounding-DINO enables detection based on text prompts, such as “detect pedestrians wearing helmets.”
OWL-ViT supports open-vocabulary detection, allowing it to find objects, even if they were not included in the training labels, defined by text descriptions.
YOLOS serves as a pure transformer baseline for detection, operating without a CNN backbone.
DETA offers more efficient training and faster convergence compared to earlier models.

All of these models are available on Hugging Face, making them ready for both inference and fine-tuning.

The pipeline returns a list of detected objects, each containing:

label: The name of the detected object
score: The model’s confidence/probability
box: A bounding box dictionary with coordinates: x, y, width, height

Fun fact: DEtection TRansformer (DETR) treats object detection as a direct set prediction problem, meaning it can detect all objects in one forward pass without relying on traditional anchor boxes.

What is happening here:

The pipeline automatically handles all necessary pre-processing steps, such as:

Resizing the image
Normalization
Post-processing the bounding boxes

This allows you to focus on interpreting results rather than preparing the data.

Benchmark datasets

These datasets are widely used for evaluating and training object detection and segmentation models.

COCO (Common Objects in Context): COCO contains 80 object categories with real-world images and is widely used for benchmarking both object detection and segmentation tasks. Its focus on standardized evaluation makes it a go-to for comparing models.
Objects365: Objects365 has 365 classes and millions of annotations. It is popular for large-scale pretraining and helps models improve generalization across diverse object types.
Open Images: Open Images includes millions of images with hierarchical labels, bounding boxes, and segmentation masks. It is particularly useful for open-world scenarios where real-world complexity and noise are important.

Each dataset has a different personality:

COCO is standards-focused,
Objects365 is large and diverse, and
Open Images captures complex real-world noise.

Where detection meets segmentation

Bounding boxes give a rough idea of object locations, but many applications require precise, pixel-level understanding.

Instance segmentation: Identifies exactly which pixels belong to each individual object.
Semantic segmentation: Groups all pixels of the same class, regardless of the individual object.

Models like Mask R-CNN, Segment Anything, and Mask2Former bridge the gap between detection and segmentation, enabling applications such as surgical diagnostics, satellite mapping, and robotic grasping.

Fun fact: Instance segmentation was first popularized with the COCO dataset, which provides pixel-level masks for thousands of objects.

Try it yourself

Now that you’ve learned the theory behind object detection, it’s time to see it in action. In this exercise, you will run a Jupyter Notebook containing code for detecting objects in images using Hugging Face pipelines. Simply execute the cells and observe the results.

Add the token value you have already created in the Text and Token Classification lesson to the first cell of the Jupyter Notebook, and then run all cells.

1.Introduction

2.NLP

Project

Breakout Session

3.Computer Vision

Project

Breakout Session

4.Conclusion

5.Appendix

Project

Object Detection

From classical computer vision to deep learning

Two-stage vs. one-stage object detectors

The DETR revolution

Direct vs. indirect transformer usage

1. Direct transformers

2. Indirect transformers

Modern transformer detectors