Introduction to Word Embedding
What is word embedding?
If you know or have knowledge about NLP, then you might have created vectors for text, i.e., converting textual data to numbers using the two most used techniques: TF-IDF (Term Frequency-Inverse Document Frequency) and CountVectorizer. Let’s look closely at these two techniques.
It stands for Term Frequency-Inverse Document Frequency. It is a statistical measure that evaluates how relevant a word is to a document in a collection of documents. This is done by multiplying two metrics: how many times a word appears in a document, and the inverse document frequency of the word across a set of documents. TF-IDF for a word in a document is calculated by multiplying two different metrics:
- The term frequency of a word in a document. There are several ways of calculating this frequency, the simplest being a raw count of instances a word appears in a document. Then there are ways to adjust the frequency: by length of a document or by the raw frequency of the most frequent word in a document.
- The inverse document frequency of the word across a set of documents. This refers to how common or rare a word is in the entire document set. The closer it is to 0, the more common a word is. This metric can be calculated by taking the total number of documents, dividing it by the number of documents that contain a word, and calculating the logarithm.
- So, if the word is very common and appears in many documents, this number will approach 0. Otherwise, it will approach 1.
- Multiplying these two numbers results in the
TF-IDFscore of a word in a document. The higher the score, the more relevant that word is in that particular document.
This technique converts a collection of text documents into a matrix of token counts. This means that the text is converted into vectors which contain the number of times the word has appeared in the sentence.
But as we discussed in our chapter of transfer learning, there are some pre-trained models for NLP tasks that can be used to generate word vectors. These word vectors are nothing but the word embedding. The two most popular word embedding are:
In this chapter, we will be using
word2vec embedding to create two mini projects, and, in the next chapter, we will use
gloVe embedding to build a sentiment analysis model.
Why is word embeddings required?
Humans can deal with textual data quite intuitively, but we have millions of text data generated in a single day. We cannot have humans performing text processing tasks. So, how do we make computers perform clustering, classification, etc. on text data? We have seen that all the models work with numeric data.
A computer can match two strings and tell whether they are the same or not. But how do we make computers understand that the USA and Donald Trump are related? How do you make a computer understand that “Apple” in “Apple is a tasty fruit” is a fruit and not a company?
To make computers understand all these things, we need to create a representation of words that capture:
- Their meanings,
- Their semantic relationships
- The different types of contexts they are used in.
This is where word embedding comes into the picture.