Fundamentals of Machine Learning: A Pythonic Introduction/

Project: Bag of Visual Words

The bag of words model has been used extensively in natural language processing to represent a document as a feature vector. Consider a language with a vocabulary $V$ of $m$ words. We can assume some ordering of the words (lexicographic) in $V$ , resulting in $v_1,v_2,\dots,v_m$ , where $v_k$ is the $k^{th}$ word in the vocabulary $V$ . To make a feature vector for any document, create a vector $\bold x$ of $m$ components, that is, $\bold x \in \R^m$ , where the $k^{th}$ component of $\bold x$ represents the frequency of the word $v_k$ in the document.

Note: The feature vectors of different-sized documents have the same number of components, and any reordering (permutation) of the words in a document keeps the feature vector unchanged.

To estimate frequencies and compare images, the most common approach is to use a similarity measure. This involves partitioning an image into adjacent patches of the same size used in the training process, computing the similarity of each patch with all the visual words, and accumulating the similarities in the respective cells of the feature vector. The closest visual word receives the highest similarity value. Finally, normalization is applied to the feature vector to reduce the impact of image size.