Word and Positional Embeddings

Explore how token IDs become continuous vectors through embeddings and how positional encodings provide sequence order information. Understand the trade-offs of sinusoidal and learned positional embeddings and their role in transformers, building a foundation for grasping attention mechanisms.

We'll cover the following...

From token IDs to continuous vectors
- The embedding matrix as a lookup table
Why position matters for transformers
Sinusoidal positional encodings
- The mathematical formulation
- Why sinusoidal functions work well
Learned positional embeddings
- Trade-offs with sinusoidal encodings
Putting it all together
Conclusion

Every neural network, including the massive language models behind modern AI assistants, operates on continuous numbers. It cannot process a raw word like “cat” or an integer token ID like 3797 directly. Before a transformer can reason about language, each token must be converted into a dense numerical vector that the model’s layers can multiply, add, and transform. This conversion, along with the injection of sequence-order information, forms the critical bridge between raw text and the transformer’s internal computations. Understanding this bridge is essential because every downstream operation, from attention to generation, depends entirely on the quality of these input representations.

From token IDs to continuous vectors

In the previous lesson, you saw how subword tokenizers such as BPE and WordPiece break raw text into discrete integer IDs. The raw text “The cat is learning” would be split into subword tokens [“The”, “cat”, “is”, “learn”, “ing”] which might then be assigned discrete integer IDs like [4532, 281, 312, 1543, 5]. These integers are convenient labels, but they carry no information about meaning or relationships. The number 281 is not “closer” to 312 in any semantic sense. Neural networks need continuous, high-dimensional vectors where arithmetic operations like addition and dot products correspond to meaningful relationships.

The embedding matrix as a lookup table

The solution is an embedding matrixa learnable two-dimensional weight matrix of shape (vocabsize × embeddingdim) where each row stores the dense vector representation for a single token in the vocabulary.. If the vocabulary contains 50,257 tokens and the embedding dimension is 768, this matrix has roughly 38.6 million parameters. For a given token ID $i$ , the model retrieves row $i$ from the matrix. Mathematically, this is equivalent to multiplying a one-hot vector $\mathbf{x}$ by the weight matrix $W_{\text{embed}}$ , but in practice it is implemented as a simple index operation for efficiency.

Typical embedding dimensions vary by model scale. BERT-base uses 768 dimensions, while GPT-3 and GPT-4 use much higher dimensions. The embedding matrix is often the single largest parameter block tied directly to vocabulary size, which connects back to the vocab-size trade-offs discussed in the tokenization lesson. A larger vocabulary means more rows and more parameters, even before the transformer layers begin.

Note: Amazon SageMaker’s Object2Vec algorithm provides an AWS-native approach for learning embeddings that capture semantic relationships. Unlike classic Word2Vec, which handles only individual words, Object2Vec can learn embeddings for diverse object types including sequences and pairs of objects.

The following diagram illustrates how the lookup operation extracts one vector per token from the embedding matrix.

With token embeddings in hand, the model has dense vectors that encode semantic content. But there is a critical piece of information still missing: where each token appears in the sequence.

Why position matters for transformers

Recurrent neural networks process tokens one at a time in strict left-to-right order, so position information is baked into the computation implicitly. Transformers take a fundamentally different approach – self-attentionA mechanism that computes pairwise relevance scores between all tokens in a sequence simultaneously, allowing each token to "attend to" every other token regardless of distance.. Because self-attention processes all tokens in parallel, it treats its input as a set rather than a sequence. This makes it permutation-invariant by default.

Consider two sentences with very different meanings:

“The dog chased the cat.”
“The cat chased the dog.”

Without positional information, self-attention computes pairwise scores using only token content. The same set of tokens appears in both sentences, so the model would produce identical representations for both. The subject-object relationship would be invisible.

This means we must add a positional signal to each token embedding before it enters the transformer stack. Two dominant strategies have emerged for this purpose: sinusoidal (fixed) positional encodings from the original “Attention Is All You Need” paper, and learned positional embeddings used by GPT-2, BERT, and most modern LLMs. ...

1.LLM Application Architectures

2.Challenges and Risks

3.Transformers and Attention

4.Vector Databases

5.Prompt Engineering

Cloud Lab

6.Fine-Tuning

Cloud Lab

7.Model Context with LangChain

8.Agentic Workflows

Cloud Lab

9.Retrieval Augmented Generation (RAG)

Cloud Lab

Cloud Lab

10.LLM Evaluation

Cloud Lab

Word and Positional Embeddings

From token IDs to continuous vectors

The embedding matrix as a lookup table

Why position matters for transformers