How to perform word embedding using Word2Vec
In natural language processing, word embeddings are a type of word representation that maps words into continuous vector spaces where semantically similar words are located closer together. This transformation into numerical vectors facilitates the processing of natural language by machine learning models and enhances the performance of various NLP tasks.
Understanding Word2Vec
Word2Vec is a popular word embedding technique developed by Google. It consists of two main models: a continuous bag of words (CBOW) and a skip-gram. These models use shallow neural networks to learn word representations by predicting either a target word from its context (CBOW) or context words from a target word (Skip-gram).
Let’s understand the theoretical foundations of CBOW and skip-gram:
Continuous bag of words (CBOW)
The CBOW seeks to determine the target word (the center word) in the context of nearby words (the surrounding words) within a preselected window size. The model is able to predict the target word based on its surrounding context, serving tasks when the meaning of words is well understood in relation to other words.
The following diagram shows a CBOW (continuous bag of words) architecture, which predicts a central word based on the surrounding words in a sentence.
Skip-gram
On the other hand, skip-gram operates in a different manner. It predicts the context words based on the central word. This means that given a central word, skip-gram anticipates what words are likely to surround it. Because of this approach, skip-gram is particularly adept at capturing
The following diagram shows a skip-gram architecture, which predicts surrounding words based on a central target word in a sentence.
Code example
The following code demonstrates the process of creating word embedding models using gensim to analyze the semantic relationships between words in the novel (moby.txt). It reads the text from a file, cleans and tokenizes it into sentences and words, and converts them to lowercase. Two-word embedding models are created: continuous bag of words (CBOW) and skip-gram, both using a vector size of 200 and a context window of 7 words. The code then prints the cosine similarities between the words “whale" and “ship" and “whale" and “sea" for each model, highlighting the different ways these models capture word relationships.
# Import all necessary modulesimport gensimfrom gensim.models import Word2Vecfrom nltk.tokenize import sent_tokenize, word_tokenizeimport warningswarnings.filterwarnings(action='ignore')txt_file_path = "moby.txt"# Reads the text filewith open(txt_file_path, 'r', encoding='utf-8') as file: text = file.read()# Replaces escape characters with spacecleaned_text = text.replace("\n", " ")data = []# Iterate through each sentence in the textfor sentence in sent_tokenize(cleaned_text):temp = []# Tokenize the sentence into wordsfor word in word_tokenize(sentence):temp.append(word.lower())data.append(temp)# Create the CBOW modelcbow_model = gensim.models.Word2Vec(data, min_count=1, vector_size=200, window=7)print("Continuous Bag of Words (CBOW)")# Print resultsprint("Cosine similarity between 'whale' and 'ship' : ", cbow_model.wv.similarity('whale', 'ship'))print("Cosine similarity between 'whale' and 'sea' : ", cbow_model.wv.similarity('whale', 'sea'))# Create the skip-gram modelskipGram_model = gensim.models.Word2Vec(data, min_count=1, vector_size=200, window=7, sg=2)print("\nSkip Gram")# Print resultsprint("Cosine similarity between 'whale' and 'ship' : ", skipGram_model.wv.similarity('whale', 'ship'))print("Cosine similarity between 'whale' and 'sea' : ", skipGram_model.wv.similarity('whale', 'sea'))
Code explanation
Here’s the explanation of the above code implementation:
Lines 2–5: These lines import the required modules for the Word2Vec implementation:
gensim: This is a library for topic modeling, document indexing, and similarity retrieval with a large collection.Word2Vecfromgensim.models: This is the Word2Vec model for training and working with word embeddings.sent_tokenizeandword_tokenizefromnltk.tokenize: These are functions for tokenizing text into sentences and words, respectively.warnings: This is a Python standard library module to handle warnings.
Line 7: This line suppresses warnings that might occur during the execution of the code. It’s often used to ignore unnecessary warning messages.
Lines 9–12: Here, the code specifies the path to the text file (
moby.txt) containing the corpus for training the Word2Vec model. It then reads the contents of the file into the variabletext.Line 15: This line removes escape characters (like newline
\n) from the text and replaces them with spaces. It ensures that the text is clean and ready for tokenization.Lines 17–27: These lines tokenize the cleaned text into sentences using
sent_tokenize, and then tokenize each sentence into words usingword_tokenize. It also converts each word to lowercase to ensure consistency.Line 30: This line creates a continuous bag of words (CBOW) model using the
Word2Vecclass fromgensim. It specifies parameters such asmin_count(minimum frequency of a word),vector_size(dimensionality of the word vectors), andwindow(size of the context window).Lines 32–36: These lines print the results of the CBOW model, specifically the cosine similarity between selected word pairs (
whaleandshipandwhaleandsea) using thesimilaritymethod.Line 39: Similar to CBOW, this line creates a skip-gram model by setting
sg=2in the parameters of theWord2Vecclass (settingsg=2in the parameters of theWord2Vecclass indicates that we want to create a skip-gram model for training).Lines 41–45: These lines print the results of the skip-gram model, showing the cosine similarity between the same selected word pairs as in the CBOW model.
Conclusion
We took a close look at Word2Vec and its usage with gensim library, exploring how it works. By training Word2Vec models on textual data, we might designate semantic similarities between words and perform different NLP tasks like sentiment analysis and text classification. As shown, Word2Vec provides a pathway to the involvement of natural language processing at a much higher level, allowing us to explore the hidden theories embedded in unstructured text data.
Free Resources