Positional Encodings and Layer Normalization
Learn how positional information is added to the embedding matrix before it is sent to the transformer for processing.
In our last lesson, we successfully transformed our prompt into a matrix of semantic embedding vectors. We now have a tangible object, a matrix of shape (number_of_tokens, embedding_dimension), that holds the meaning of our words.
However, this matrix, as powerful as it is, presents us with a profound paradox. The transformer architecture, which we will explore later on, is designed for blazing-fast parallel computation. It processes every single token in our prompt simultaneously. However, this speed comes at a cost: by abandoning the one-word-at-a-time processing of older models, the transformer lacks an inherent sense of word order. To it, our matrix is just a “bag of words.”
How can a model that is so good at language not know the difference between “dog chases cat” and “cat chases dog”? In this lesson, we will solve this fundamental problem and prepare our matrix for the intense computation to come.
The paradox of parallelism
Before the transformer, models like Recurrent Neural Networks (RNNs) processed language sequentially, like an assembly line. They would read “dog,” update their internal memory, then read “bites,” and so on. Order was baked into the very structure of their computation, but this made them slow.
The transformer’s creators made a radical bet: what if we could process every word at once? Think of it like a team of workers in a room, all given their assignments simultaneously. This is massively parallel and incredibly fast. But it creates a new problem: how does the worker with the “dog” vector know they are first? How does the worker with the “man” vector know they are last?
The elegant solution to this self-inflicted problem is positional encodings. This is the invention that makes the transformer’s bet on parallelism possible. We create a second matrix of the exact same size as our embedding matrix, but this one contains unique vectors, or “color stripes,” that signal the position of each token. We then add this matrix directly to our embedding matrix.
The result is that the vector for “dog” at the start of a sentence becomes fundamentally different from the vector for “dog” at the end. The positional signal is now embedded in every single vector, providing the model with the crucial information it needs to be order-aware. ...