ELMo: Taking Ambiguities Out of Word Vectors

Learn about the ELMo representation algorithm, and find ambiguities in word vectors.

Limitations of vanilla word embeddings

So far, we’ve looked at word embedding algorithms that can give only a unique representation of the words in the vocabulary. However, they will give a constant representation for a given word, no matter how many times we query. Why would this be a problem? Consider the following two phrases:

I went to the bank to deposit some money


I walked along the river bank

Clearly, the word “bank” is used in two totally different contexts. If we use a vanilla word vector algorithm (e.g., skip-gram), we can only have one representation for the word “bank,” and it’s probably going to be muddled between the concept of a financial institution and the concept of walkable edges along a river, depending on the references to this word found in the corpus it’s trained on. Therefore, it’s more sensible to provide embeddings for a word while preserving and leveraging the context around it. This is exactly what ELMo is striving for.

ELMo: Contextualized word embeddings

Specifically, ELMo takes in a sequence, as opposed to a single token, and provides contextualized representations for each token in the sequence. The figure below depicts various components encompassing the model. The first thing to understand is that ELMo is a complicated beast! There are lots of neural network models orchestrating in ELMo to produce the output. Particularly, the model uses:

  • A character embedding layer (an embedding vector for each character).

  • A convolutional neural network (CNN), which consists of many convolutional layers followed by an optional fully connected classification layer. A convolution layer takes in a sequence of inputs (e.g., a sequence of characters in a word) and moves a window of weights over the input to generate a latent representation.

  • Two bidirectional LSTM layers. An LSTM is a type of model that is used to process time-series data. Given a sequence of inputs (e.g., sequence of word vectors), an LSTM goes from one input to the other on the time dimension and produces an output at each position. Unlike fully connected networks, LSTMs have memory, meaning the output at the current position will be affected by what the LSTM has seen in the past.

Get hands-on with 1200+ tech skills courses.