Search⌘ K
AI Features

Transformer Architecture: Embedding Layers

Learn how transformer architectures utilize embedding layers to represent words and their positions within sequences. This lesson explains token embeddings and positional embeddings, the mathematical basis for positional encodings, and how these embeddings combine to enable transformers to understand word context and order.

Word embeddings provide a semantic-preserving representation of words based on the context in which words are used. In other words, if two words are used in the same context, they will have similar word vectors. For example, the words “cat” and “dog” will have similar representations, whereas “cat” and “volcanowill have vastly different representations.

Word vectors were initially introduced in the paper titled Efficient Estimation of Word Representations in Vector SpaceMikolov et al. (https://arxiv.org/pdf/1301.3781.pdf). It came in two variants: skip-gram and continuous bag-of-words. Embeddings work by first defining a large matrix of size V×EV \times E, where VV is the size of the vocabulary, and ...