In this shot, we will build an NLP engine that will pick the odd word from a set of words. For example, if we have a list of words like “Apple”, “Mango”, “Party”, and “Juice,” it is clear that ‘Party’ is the odd word out.
For this, we are going to use Gensim’s
word2vec model. Gensim provides an optimum implementation of word2vec’s
Before moving on, you need to download the word2vec vectors. Click here to download the vectors. Remember that the file size is ~1.5GB. We suggest you work on Google Colab for this as the size of the file is very large.
Open your Google Colab and run the command below to get your word vectors.
!wget -P /root/input/ -c "https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz"
This command will download it on Google servers and will save a lot of time.
Now, let’s install the packages we require, as shown below.
pip install gensim pip install scikit-learn
You can run the command above in both Google Colab and on your local machine (if you’re using that).
Now, let’s move on to the coding part by first importing the packages in the following way:
from gensim.models import KeyedVectors from sklearn.metrics.pairwise import cosine_similarity print('Imported Successfully!')
We imported two packages that will be used in the following way:
gensimpackage will be used to load the word vectors that we downloaded.
KeyedVectorsessentially contain the mapping between words and embedding. After training, it can be used to directly query those embedding in various ways.
scikit-learn'scosine similarity to calculate the distance between two words. This distance metric is commonly used and provides good results for various problems.
Next, we are going to create a function that will take a list of strings and return a string that is very different from all of them. This function can be created as shown below.
# Accepts list_of_words and the word2vec vectors def odd_one_out(words,word_vectors): all_word_vectors = [word_vectors[w] for w in words] avg_vector = np.mean(all_word_vectors, axis = 0) odd_one_out = None min_sim = 1.0 for w in words: sim = cosine_similarity([word_vectors[w]],[avg_vector]) if sim < min_sim: min_sim = sim odd_one_out = w return odd_one_out print("Function Created Successfully!")
The explanation for the code above is:
1.0, which is the maximum value we can get. We can compare this value later on and check if we get any value smaller than
1.0. The smallest value will give the desired result.
Now that we have our function ready, let’s use it with some inputs.
word_vectors = KeyedVectors.load_word2vec_format('/root/input/GoogleNews-vectors-negative300.bin.gz', binary=True) list_of_words = ["apple","mango","party","juice","orange"] print(odd_one_out(list_of_words,word_vectors))
What do you think the output is going to be? Choose an option and let’s see whether you can find the correct output or not.
What will be the output of the code above?
This correct answer is chosen because all of the other words were food items and somewhat related to each other. Thus, our system was able to correctly find the odd word out from the given set of words.
View all Courses