The Bag of Visual Words (BoVW) model is a computer vision adaptation of the Bag of Words (BoW) model used in Natural Language Processing (NLP). The goal is to represent images as a simple, fixed-length frequency vector (a histogram) for classification.
In text analysis, the BoW model represents a document based only on the frequency of words it contains, ignoring grammar and word order.
Note: The feature vectors of different-sized documents have the same number of components, and any reordering (permutation) of the words in a document keeps the feature vector unchanged.
To adapt this to images, we must define a visual vocabulary where “visual words” replace text words. This step uses K-means clustering.
Mechanism using K-means
Extract patches: Small, fixed-sized image patches are cropped from all images in the training set. These patches are treated as the input data points for clustering. For example, we crop thousands of pixel patches from all cat and truck training images.
Cluster patches: K-means clustering is applied to these patches (with a chosen number of clusters, ). For example, we run K-means with on all the extracted feature vectors.
Visual words (Centroids): The centroids of the resulting clusters form the visual vocabulary (or dictionary). Each centroid represents a common visual feature. For example, the 10 cluster centroids () are saved. might represent a truck wheel pattern, and might represent a cat ear pattern.
Once the visual vocabulary is created, any image (a new image to be classified) is converted into a fixed-length feature vector (a histogram of visual word frequencies).
Partition image: The new image is partitioned into adjacent patches of the same size used during vocabulary training. For example, we divide the new cat image into dozens of patches and extract a feature descriptor from each one.
Quantization (voting): For each patch, the closest visual word (centroid) is found using a similarity measure (distance). For a patch of cat fur, we calculate its distance to all 10 visual words ( to ). We find it is closest to . This patch votes for visual word .
Accumulation: The patch votes for the closest visual word. The count (frequency) corresponding to that visual word’s index in the feature vector is incremented. For example, after processing all patches, if 39 patches voted for , the final feature vector will have a count of 39 at the 10th position.
Feature vector: This process results in a fixed-length feature vector where each component records the frequency of a specific visual word found in the image.
Normalization: The feature vector is typically normalized (e.g., divided by the total number of patches) to reduce the impact of the original image size on the final frequency counts.
In this project, the BoVW method is used to represent images as fixed-length feature vectors. The goal is to classify images of cats and trucks.
The project is divided into the following sub-tasks:
Importing necessary modules.
Loading and visualizing datasets.
Cropping patches from the images.
Creating a Bag of Visual Words (using clustering).
Preparing data for the classifier.
Building a classifier to distinguish between cats and trucks.