The Skip-Gram Algorithm
Explore the skip-gram algorithm to learn how it generates word embeddings by leveraging surrounding word contexts in text data. This lesson guides you through preparing training data, the algorithm's mechanics, and implementing it in TensorFlow, helping you build foundational skills for NLP tasks.
The first algorithm we’ll talk about is known as the skip-gram algorithm, which is a type of Word2vec algorithm. As we have discussed in numerous places, the meaning of a word can be elicited from the contextual words surrounding it. However, it isn’t entirely straightforward to develop a model that exploits this way of learning word meanings. The skip-gram algorithm, introduced by Mikolov et al. in 2013, is an algorithm that exploits the context of the words in a written text to learn good word embeddings.
Let’s go through the skip-gram algorithm step by step. First, we’ll discuss the data preparation process. Understanding the format of the data puts us in a great position to understand the algorithm. We’ll then discuss the algorithm itself. Finally, we’ll implement the algorithm using TensorFlow.
From raw text to semistructured text
First, we need to design a mechanism to extract a dataset that can be fed to our learning model. Such a dataset should be a set of tuples of the format (target, context). Moreover, this needs to be created in an unsupervised manner. That is, a human should not have to manually engineer the labels for the data. In summary, the data preparation process should do the following:
- Capture the surrounding words of a given word (that is, the context).
- Run in an unsupervised manner.
The skip-gram model uses the following approach to design a dataset:
-
For a given word , a context window size of is assumed. By “context window size,” we mean the number of words considered as context on a single side. Therefore, for , the context window (including the target word ) will be of size and will look like this:
-
Next, (target, context) tuples are formed as ...