Transformer Architecture: Embedding Layers

Learn about the embedding layers in the transformer.

Word embeddings provide a semantic-preserving representation of words based on the context in which words are used. In other words, if two words are used in the same context, they will have similar word vectors. For example, the words “cat” and “dog” will have similar representations, whereas “cat” and “volcanowill have vastly different representations.

Word vectors were initially introduced in the paper titled Efficient Estimation of Word Representations in Vector SpaceMikolov et al. ( It came in two variants: skip-gram and continuous bag-of-words. Embeddings work by first defining a large matrix of size V×EV \times E, where VV is the size of the vocabulary, and EE is the size of the embeddings. EE is a user-defined hyperparameter; a larger EE typically leads to more powerful word embeddings. In practice, we don’t need to increase the size of embeddings beyond 300.

General approach for word embeddings

Motivated by the original word vector algorithms, modern deep learning models use embedding layers to represent words and tokens. The following general approach (along with pretraining later to fine-tune these embeddings) is taken to incorporate word embeddings into a machine learning model:

  • Define a randomly initialized word embedding matrix (or pretrained embeddings, available to download for free).

  • Define the model (randomly initialized) that uses word embeddings as the inputs and produces an output (for example, sentiment or a language translation).

  • Train the whole model (embeddings and the model) end to end on the task.

Embeddings in transformer models

The same technique is used in transformer models. However, in transformer models, there are two different embeddings:

  • Token embeddings provide a unique representation for each token seen by the model in an input sequence.

  • Positional embeddings provide a unique representation for each position in the input sequence.

The token embeddings have a unique embedding vector for each token (such as character, word, and subword), depending on the model’s tokenizing mechanism.

The positional embeddings are used to signal the model where a token is appearing. The primary purpose of the positional embeddings server is to inform the transformer model where a word is appearing. This is because, unlike LSTMs/GRUs, transformer models don’t have a notion of sequence because they process the whole text in one go. Furthermore, a change to the position of a word can alter the meaning of a sentence/or a word. For example:

Ralph loves his tennis ball. It likes to chase the ball.

Ralph loves his tennis ball. Ralph likes to chase it.

In the sentences above, the word “it” refers to different things, and the position of the word “it” can be used as a cue to identify this difference. The original transformer paper uses the following equations to generate positional embeddings:

Get hands-on with 1200+ tech skills courses.