Word and Positional Embeddings
Explore how token IDs become continuous vectors through embeddings and how positional encodings provide sequence order information. Understand the trade-offs of sinusoidal and learned positional embeddings and their role in transformers, building a foundation for grasping attention mechanisms.
Every neural network, including the massive language models behind modern AI assistants, operates on continuous numbers. It cannot process a raw word like “cat” or an integer token ID like 3797 directly. Before a transformer can reason about language, each token must be converted into a dense numerical vector that the model’s layers can multiply, add, and transform. This conversion, along with the injection of sequence-order information, forms the critical bridge between raw text and the transformer’s internal computations. Understanding this bridge is essential because every downstream operation, from attention to generation, depends entirely on the quality of these input representations.
From token IDs to continuous vectors
In the previous lesson, you saw how subword tokenizers such as BPE and WordPiece break raw text into discrete integer IDs. The raw text “The cat is learning” would be split into subword tokens [“The”, “cat”, “is”, “learn”, “ing”] which might then be assigned discrete integer IDs like [4532, 281, 312, 1543, 5]. These integers are convenient labels, but they carry no information about meaning or relationships. The number 281 is not “closer” to 312 in any semantic sense. Neural networks need continuous, high-dimensional vectors where arithmetic operations like addition and dot products correspond to meaningful relationships.
The embedding matrix as a lookup table
The solution is an
Typical embedding dimensions vary by model scale. BERT-base uses 768 dimensions, while GPT-3 and GPT-4 use much higher dimensions. The embedding matrix is often the single largest parameter block tied directly to vocabulary size, which connects back to the vocab-size trade-offs discussed in the tokenization lesson. A larger vocabulary means more rows and more parameters, even before the transformer layers begin.
Note: Amazon SageMaker’s Object2Vec algorithm provides an AWS-native approach for learning embeddings that capture semantic relationships. Unlike classic Word2Vec, which handles only individual words, Object2Vec can learn embeddings for diverse object types including sequences and pairs of objects.
The following diagram illustrates how the lookup operation extracts one vector per token from the embedding matrix.
With token embeddings in hand, the model has dense vectors that encode semantic content. But there is a critical piece of information still missing: where each token appears in the sequence.
Why position matters for transformers
Recurrent neural networks process tokens one at a time in strict left-to-right order, so position information is baked into the computation implicitly. Transformers take a fundamentally different approach –
Consider two sentences with very different meanings:
“The dog chased the cat.”
“The cat chased the dog.”
Without positional information, self-attention computes pairwise scores using only token content. The same set of tokens appears in both sentences, so the model would produce identical representations for both. The subject-object relationship would be invisible.
This means we must add a positional signal to each token embedding before it enters the transformer stack. Two dominant strategies have emerged for this purpose: sinusoidal (fixed) positional encodings from the original “Attention Is All You Need” paper, and learned positional embeddings used by GPT-2, BERT, and most modern LLMs. ...