Search⌘ K
AI Features

Object Detection

Explore object detection fundamentals including bounding boxes and confidence scores. Understand classical methods, CNN detectors like Faster R-CNN and YOLO, and transformer-based models such as DETR, DINO, and Grounding-DINO. Gain hands-on experience using Hugging Face pipelines to run object detection on images efficiently.

Object detection is a core computer vision task that enables machines to identify what objects appear in an image and where they are located. Unlike image classification, which assigns a single label to an entire image, object detection produces a set of detected objects, each with a bounding box and a confidence score.

This capability is foundational to many real-world applications, including:

  • Autonomous driving and traffic monitoring

  • Retail shelf analytics

  • Medical imaging and diagnostics

  • Industrial inspection and robotics

An example of object detection
An example of object detection

From classical computer vision to deep learning

Before deep learning, object detection relied on manually crafted features.

Techniques such as Haar Cascades and Histogram of Oriented Gradients (HOG) searched for edges, textures, and patterns defined by humans. While these methods were fast, they were fragile—their performance dropped sharply when objects appeared under different lighting conditions, angles, or backgrounds. CNN-based detectors transformed this approach. Convolutional layers automatically learn relevant features directly from data, shifting image understanding from manual feature engineering to end-to-end learning.

Two-stage vs. one-stage object detectors

Academic research and industry quickly converged on two families of architectures:

  • Two-stage models (R-CNN → Fast R-CNN → Faster R-CNN):

    • They first generate region proposals, then classify them. This two-step process makes them highly accurate and reliable for medical imaging, satellite data, and scientific analysis, where missing an object can be costly.

  • One-stage models (SSD, YOLO):

    • They skip proposals and predict boxes + labels in one pass. This makes them fast and real-time, ideal for drones, robotics, traffic cameras, and mobile apps.

Fun fact: YOLO-v1 (2015) was trained on a single consumer GPU and still ran in real-time; this achievement helped kickstart modern real-world applications of computer vision.

This era, which spanned from 2015 to 2020, remains the backbone of many enterprise systems today.

1.

Why do two-stage detectors remain popular despite newer models?

Show Answer
Did you find this helpful?

The DETR revolution

In 2020, DEtection TRansformer (DETR) introduced a radical idea: treat detection as set prediction, not box proposal generation. Instead of searching across thousands of anchor boxes or candidate regions, DETR uses:

  1. A backbone (CNN or Vision Transformer) to extract features.

  2. A transformer encoder to model relationships across the entire image.

  3. A transformer decoder with object queries that directly predict bounding boxes and labels.

  4. A Hungarian matching step to pair predicted objects with ground truth.

No anchors, no proposal networks, no non-max suppression. Simpler training objective, more robust predictions, and better object context modeling.

1.

What makes DETR different from YOLO or Faster R-CNN?

Show Answer
Did you find this helpful?

Fun fact: DETR often looks “bad” early in training, but suddenly improves after many epochs, a classic example of transformers learning global structure over time.

Direct vs. indirect transformer usage

In computer vision, transformers can operate in two fundamentally different ways depending on the task.

1. Direct transformers

In this setup, the model receives the image directly as a sequence of tokens. The image is divided into fixed-size patches (e.g., 16×16), each patch is embedded into a vector, and these vectors form the input sequence. The transformer then reasons over the entire image at once.
This approach is ideal for image-level tasks, such as classification, where the goal is simply to answer:

What is in this image?

Because the model perceives the entire image within a unified context, ViTs excel at recognizing global patterns, such as object categories, textures, and abstract shapes.

1.

How do direct Vision Transformers (ViTs) process images?

Show Answer
Did you find this helpful?

2. Indirect transformers

Indirect transformer models still rely on transformers, but they use them after extracting local visual features.
A CNN or ViT backbone first converts the image into a feature map.
Then a transformer decoder receives a set of learned “object queries” and predicts:

  • object classes

  • bounding box coordinates

Instead of scanning thousands of anchor boxes, these models treat detection as set prediction, reducing complexity and improving semantic reasoning.

This approach answers:

What is in the image, and where is it?

Indirect transformer systems are utilized for object detection, counting, and localization, where spatial relationships are crucial.

Fun fact: DETR was so radically different from classic detectors that many researchers initially thought it wouldn’t scale, later versions like DINO and Grounding DINO proved the opposite.

Modern transformer detectors

Research in object detection did not stop with DETR. Several new and improved transformer-based detectors address limitations in training speed, generalization, and flexibility:

  • DINO improves object queries and uses self-distillationChatGPT said: Self-distillation is a technique where a model learns from its own predictions to improve its accuracy and generalization., resulting in higher accuracy.

  • Grounding-DINO enables detection based on text prompts, such as “detect pedestrians wearing helmets.”

  • OWL-ViT supports open-vocabulary detection, allowing it to find objects, even if they were not included in the training labels, defined by text descriptions.

  • YOLOS serves as a pure transformer baseline for detection, operating without a CNN backbone.

  • DETA offers more efficient training and faster convergence compared to earlier models.

All of these models are available on Hugging Face, making them ready for both inference and fine-tuning.

1.

What is the main advantage of modern transformer detectors like OWL-ViT or Grounding-DINO over traditional object detection models?

Show Answer
Did you find this helpful?

Fun fact: Grounding-DINO can detect objects you describe in plain English, essentially letting you “talk to the model” to find objects in images.

Object detection with Hugging Face pipelines

Hugging Face makes running object detection very simple. With just a few lines of code, you can detect objects in an image:

from transformers import pipeline
# Load the object detection pipeline with DETR
detector = pipeline("object-detection", model="facebook/detr-resnet-50")
# Run inference on an image
output = detector("Image_url")
# Print results
print(output)
Detect objects in an image using the DETR model from Hugging Face

The pipeline returns a list of detected objects, each containing:

  • label: The name of the detected object

  • score: The model’s confidence/probability

  • box: A bounding box dictionary with coordinates: x, y, width, height

Fun fact: DEtection TRansformer (DETR) treats object detection as a direct set prediction problem, meaning it can detect all objects in one forward pass without relying on traditional anchor boxes.

What is happening here:

The pipeline automatically handles all necessary pre-processing steps, such as:

  • Resizing the image

  • Normalization

  • Post-processing the bounding boxes

This allows you to focus on interpreting results rather than preparing the data.

1.

What does the box field in the object detection output represent?

Show Answer
Did you find this helpful?

Benchmark datasets

These datasets are widely used for evaluating and training object detection and segmentation models.

  • COCO (Common Objects in Context): COCO contains 80 object categories with real-world images and is widely used for benchmarking both object detection and segmentation tasks. Its focus on standardized evaluation makes it a go-to for comparing models.

  • Objects365: Objects365 has 365 classes and millions of annotations. It is popular for large-scale pretraining and helps models improve generalization across diverse object types.

  • Open Images: Open Images includes millions of images with hierarchical labels, bounding boxes, and segmentation masks. It is particularly useful for open-world scenarios where real-world complexity and noise are important.

Each dataset has a different personality:

  • COCO is standards-focused,

  • Objects365 is large and diverse, and

  • Open Images captures complex real-world noise.

1.

Which dataset is most suitable for large-scale pretraining to improve generalization?

Show Answer
Did you find this helpful?

Where detection meets segmentation

Bounding boxes give a rough idea of object locations, but many applications require precise, pixel-level understanding.

  • Instance segmentation: Identifies exactly which pixels belong to each individual object.

  • Semantic segmentation: Groups all pixels of the same class, regardless of the individual object.

Models like Mask R-CNN, Segment Anything, and Mask2Former bridge the gap between detection and segmentation, enabling applications such as surgical diagnostics, satellite mapping, and robotic grasping.

Fun fact: Instance segmentation was first popularized with the COCO dataset, which provides pixel-level masks for thousands of objects.

Try it yourself

Now that you’ve learned the theory behind object detection, it’s time to see it in action. In this exercise, you will run a Jupyter Notebook containing code for detecting objects in images using Hugging Face pipelines. Simply execute the cells and observe the results.

Add the token value you have already created in the Text and Token Classification lesson to the first cell of the Jupyter Notebook, and then run all cells.

Please login to launch live app!

Summary

Object detection identifies the objects present in an image and their locations, utilizing bounding boxes.

Early methods employed hand-crafted features, whereas CNNs enabled end-to-end learning with models such as Faster R-CNN and YOLO. DETR and modern transformer detectors, such as DINO and Grounding-DINO, improve accuracy, context understanding, and open-vocabulary detection.

Hugging Face pipelines make it easy to run these models and see results on real images.