How to perform word embedding using Word2Vec

In natural language processing, word embeddings are a type of word representation that maps words into continuous vector spaces where semantically similar words are located closer together. This transformation into numerical vectors facilitates the processing of natural language by machine learning models and enhances the performance of various NLP tasks.

Understanding Word2Vec

Word2Vec is a popular word embedding technique developed by Google. It consists of two main models: a continuous bag of words (CBOW) and a skip-gram. These models use shallow neural networks to learn word representations by predicting either a target word from its context (CBOW) or context words from a target word (Skip-gram).

Let’s understand the theoretical foundations of CBOW and skip-gram:

Continuous bag of words (CBOW)

The CBOW seeks to determine the target word (the center word) in the context of nearby words (the surrounding words) within a preselected window size. The model is able to predict the target word based on its surrounding context, serving tasks when the meaning of words is well understood in relation to other words.

The following diagram shows a CBOW (continuous bag of words) architecture, which predicts a central word based on the surrounding words in a sentence.

Skip-gram

On the other hand, skip-gram operates in a different manner. It predicts the context words based on the central word. This means that given a central word, skip-gram anticipates what words are likely to surround it. Because of this approach, skip-gram is particularly adept at capturing syntacticSyntactical relations refer to the grammatical connections between words, like how they fit together in a sentence. and semanticSemantic relations relate to the meanings of words and how they're connected conceptually. relationships between words.

The following diagram shows a skip-gram architecture, which predicts surrounding words based on a central target word in a sentence.

Code example

The following code demonstrates the process of creating word embedding models using gensim to analyze the semantic relationships between words in the novel (moby.txt). It reads the text from a file, cleans and tokenizes it into sentences and words, and converts them to lowercase. Two-word embedding models are created: continuous bag of words (CBOW) and skip-gram, both using a vector size of 200 and a context window of 7 words. The code then prints the cosine similarities between the words “whale" and “ship" and “whale" and “sea" for each model, highlighting the different ways these models capture word relationships.

main.py

moby.txt

# Import all necessary modules
import gensim
from gensim.models import Word2Vec
from nltk.tokenize import sent_tokenize, word_tokenize
import warnings
warnings.filterwarnings(action='ignore')
txt_file_path = "moby.txt"
# Reads the text file
with open(txt_file_path, 'r', encoding='utf-8') as file: text = file.read()
# Replaces escape characters with space
cleaned_text = text.replace("\n", " ")
data = []
# Iterate through each sentence in the text
for sentence in sent_tokenize(cleaned_text):
    temp = []
    # Tokenize the sentence into words
    for word in word_tokenize(sentence):
        temp.append(word.lower())
    data.append(temp)
# Create the CBOW model
cbow_model = gensim.models.Word2Vec(data, min_count=1, vector_size=200, window=7)
print("Continuous Bag of Words (CBOW)")
# Print results
print("Cosine similarity between 'whale' and 'ship' : ", cbow_model.wv.similarity('whale', 'ship'))
print("Cosine similarity between 'whale' and 'sea' : ", cbow_model.wv.similarity('whale', 'sea'))
# Create the skip-gram model
skipGram_model = gensim.models.Word2Vec(data, min_count=1, vector_size=200, window=7, sg=2)
print("\nSkip Gram")
# Print results
print("Cosine similarity between 'whale' and 'ship' : ", skipGram_model.wv.similarity('whale', 'ship'))
print("Cosine similarity between 'whale' and 'sea' : ", skipGram_model.wv.similarity('whale', 'sea'))

Code explanation

Here’s the explanation of the above code implementation:

Lines 2–5: These lines import the required modules for the Word2Vec implementation:
- gensim: This is a library for topic modeling, document indexing, and similarity retrieval with a large collection.
- Word2Vec from gensim.models: This is the Word2Vec model for training and working with word embeddings.
- sent_tokenize and word_tokenize from nltk.tokenize: These are functions for tokenizing text into sentences and words, respectively.
- warnings: This is a Python standard library module to handle warnings.
Line 7: This line suppresses warnings that might occur during the execution of the code. It’s often used to ignore unnecessary warning messages.
Lines 9–12: Here, the code specifies the path to the text file (moby.txt) containing the corpus for training the Word2Vec model. It then reads the contents of the file into the variable text.
Line 15: This line removes escape characters (like newline \n) from the text and replaces them with spaces. It ensures that the text is clean and ready for tokenization.
Lines 17–27: These lines tokenize the cleaned text into sentences using sent_tokenize, and then tokenize each sentence into words using word_tokenize. It also converts each word to lowercase to ensure consistency.
Line 30: This line creates a continuous bag of words (CBOW) model using the Word2Vec class from gensim. It specifies parameters such as min_count (minimum frequency of a word), vector_size (dimensionality of the word vectors), and window (size of the context window).
Lines 32–36: These lines print the results of the CBOW model, specifically the cosine similarity between selected word pairs (whale and ship and whale and sea) using the similarity method.
Line 39: Similar to CBOW, this line creates a skip-gram model by setting sg=2 in the parameters of the Word2Vec class (setting sg=2 in the parameters of the Word2Vec class indicates that we want to create a skip-gram model for training).
Lines 41–45: These lines print the results of the skip-gram model, showing the cosine similarity between the same selected word pairs as in the CBOW model.

Conclusion

We took a close look at Word2Vec and its usage with gensim library, exploring how it works. By training Word2Vec models on textual data, we might designate semantic similarities between words and perform different NLP tasks like sentiment analysis and text classification. As shown, Word2Vec provides a pathway to the involvement of natural language processing at a much higher level, allowing us to explore the hidden theories embedded in unstructured text data.

Free Resources