Object Detection with Vision Transformers
The project focuses on implementing object detection using vision transformers (ViT). The chosen dataset is the Caltech 101 Dataset, specifically the category of airplanes. The notebook begins with necessary imports and setups, followed by the preparation of the dataset by downloading and extracting the Caltech 101 dataset. The code then lists and sorts paths to images and annotations, preparing them for further processing. Subsequently, image resizing and preprocessing are performed to create training and testing datasets. The notebook defines a multilayer perceptron (MLP) function and introduces layers for patch creation and encoding. It visualizes patches generated from input images and implements a ViT model, which is trained and evaluated on the Caltech 101 dataset. The training history is plotted, and the model is saved. Finally, the project evaluates the trained ViT model by calculating Intersection over Union (IoU) on a subset of test images, visualizing both predicted and ground truth bounding boxes. This project provides a comprehensive implementation of ViT for object detection with detailed explanations and visualizations.