Search⌘ K

Project: Bag of Visual Words

The Bag of Visual Words (BoVW) model is a computer vision adaptation of the Bag of Words (BoW) model used in Natural Language Processing (NLP). The goal is to represent images as a simple, fixed-length frequency vector (a histogram) for classification.

In text analysis, the BoW model represents a document based only on the frequency of words it contains, ignoring grammar and word order.

  • Vocabulary: A predetermined set of mm unique words, {v1,v2,,vm}\{v_1, v_2, \dots, v_m\}.
  • Feature vector (x\mathbf{x}): Any document is converted into a vector xRm\mathbf{x} \in \mathbf{R}^m, where the kthk^{\text{th}} component represents the frequency of the word vkv_k in the document.

Note: The feature vectors of different-sized documents have the same number of components, and any reordering (permutation) of the words in a document keeps the feature vector unchanged.

Defining the visual vocabulary

To adapt this to images, we must define a visual vocabulary where “visual words” replace text words. This step uses K-means clustering.

Mechanism using K-means

  1. Extract patches: Small, fixed-sized image patches are cropped from all images in the training set. These patches are treated as the input data points for clustering. For example, we crop thousands of 16×1616 \times 16 pixel patches from all cat and truck training images.

  2. Cluster patches: K-means clustering is applied to these patches (with a chosen number of clusters, kk). For example, we run K-means with K=10K=10 on all the extracted feature vectors.

  3. Visual words (Centroids): The centroids of the kk resulting clusters form the visual vocabulary (or dictionary). Each centroid represents a common visual feature. For example, the 10 cluster centroids (μ1,,μ10\mu_1, \dots, \mu_{10}) are saved. μ1\mu_1 might represent a truck wheel pattern, and μ10\mu_{10} might represent a cat ear pattern.

canvasAnimation-image
1 / 4

Feature extraction (Image-to-vector)

Once the visual vocabulary is created, any image (a new image to be classified) is converted into a fixed-length feature vector (a histogram of visual word frequencies).

  1. Partition image: The new image is partitioned into adjacent patches of the same size used during vocabulary training. For example, we divide the new cat image into dozens of 16×1616 \times 16 patches and extract a feature descriptor from each one.

  2. Quantization (voting): For each patch, the closest visual word (centroid) is found using a similarity measure (distance). For a patch of cat fur, we calculate its distance to all 10 visual words (μ1\mu_1 to μ10\mu_{10}). We find it is closest to μ10\mu_{10}. This patch votes for visual word μ10\mu_{10}.

  3. Accumulation: The patch votes for the closest visual word. The count (frequency) corresponding to that visual word’s index in the feature vector is incremented. For example, after processing all patches, if 39 patches voted for μ10\mu_{10}, the final feature vector will have a count of 39 at the 10th position.

  4. Feature vector: This process results in a fixed-length feature vector where each component records the frequency of a specific visual word found in the image.

  5. Normalization: The feature vector is typically normalized (e.g., divided by the total number of patches) to reduce the impact of the original image size on the final frequency counts.

Feature extraction in BoVW
Feature extraction in BoVW

Project goal

In this project, the BoVW method is used to represent images as fixed-length feature vectors. The goal is to classify images of cats and trucks.

The project is divided into the following sub-tasks:

  • Importing necessary modules.

  • Loading and visualizing datasets.

  • Cropping patches from the images.

  • Creating a Bag of Visual Words (using clustering).

  • Preparing data for the classifier.

  • Building a classifier to distinguish between cats and trucks.