Project: Bag of Visual Words
The bag of words model has been used extensively in natural language processing to represent a document as a feature vector. Consider a language with a vocabulary of words. We can assume some ordering of the words (lexicographic) in , resulting in , where is the word in the vocabulary . To make a feature vector for any document, create a vector of components, that is, , where the component of represents the frequency of the word in the document.
Note: The feature vectors of different-sized documents have the same number of components, and any reordering (permutation) of the words in a document keeps the feature vector unchanged.
The bag of visual words (BoVW) is a method of representing images as fixed-length feature vectors. To define the vocabulary in images, one common approach is to cluster cropped patches of fixed sizes from all the training set images using
To estimate frequencies and compare images, the most common approach is to use a similarity measure. This involves partitioning an image into adjacent patches of the same size used in the training process, computing the similarity of each patch with all the visual words, and accumulating the similarities in the respective cells of the feature vector. The closest visual word receives the highest similarity value. Finally, normalization is applied to the feature vector to reduce the impact of image size.
In this project, the BoVW method is used to represent images as fixed-length feature vectors. The goal is to classify images of cats and trucks. The project is divided into sub-tasks, including importing modules, loading datasets, visualizing images, cropping patches, creating a bag of visual words, preparing data for the classifier, and building a classifier.