Positional Encodings
Explore how positional encodings add order information to transformer models, which lack inherent sequence awareness. Understand absolute encodings using sinusoidal or learned vectors and relative encodings that capture token distances. Learn trade-offs and their importance for language understanding tasks and AI interviews.
Interviewers often ask about positional encodings because they reveal whether you understand a core limitation of Transformers: unlike RNNs or CNNs, Transformers have no built-in sense of word order. Without an added signal, they treat sentences as bags of tokens—“John likes cats” and “Cats like John” would be indistinguishable. Positional encodings inject this missing order information.
A strong answer explains why attention alone can’t recover position, and how the two major solutions—absolute and relative positional encodings—work. Interviewers also expect awareness of trade-offs (fixed vs. learned, sinusoidal vs. rotary, distance-based biases) and how these choices affect real models. Even with newer architectures emerging, understanding positional encodings remains essential because most modern LLMs still rely on some form of them.
Why do transformers need positional encoding?
Transformers process all tokens in parallel through self-attention, so they have no inherent information about sequence order. Positional encoding is the extra signal we add to tell the model where each word sits in the sequence. In practice, we assign each position in the input (1st word, 2nd word, etc.) a unique vector and add that to the token’s word embedding. You can think of it like giving each word a timestamp or a coordinate. For example, consider the French sentence “Je suis étudiant” (“I am a student”). Before feeding it into a Transformer encoder, we take the embedding of each word (the green bars below) and add a positional vector (yellow bars) that encodes that word’s position in the sentence.
The model can use that information in attention and later layers by adding the position vector to each word embedding. In effect, the transformer can now distinguish “the first word” from “the second word,” etc. This turns the transformer from order-agnostic to order-aware. Intuitively, you might imagine the positional encoding as a kind of barcode: it uniquely labels token positions so that downstream layers can factor in where in the sequence each token appeared. This is critical for language tasks. For example, with proper positional encoding, “The cat sat on the mat.” is very different from “The mat sat on the cat,” and the model can learn that difference.
Interview trap: An interviewer might ask, “Can’t the Transformer learn positional information implicitly from the data patterns?” and candidates sometimes say, “Yes, attention can figure out word order from context."
However, that’s incorrect! Self-attention is mathematically permutation-equivariant—if you shuffle the input tokens, you get the same outputs (just shuffled). Without explicit positional encoding, there’s literally no signal in the architecture that distinguishes token order. The model cannot “figure out” position from attention alone because the attention mechanism treats all positions symmetrically by design.
Originally, positional encodings were introduced in the “Attention Is All You Need” paper, which states that “we must inject some information about the relative or absolute position of the tokens in the sequence.” The paper then adds these positional vectors to the input embeddings at the bottom of the model. A strong interview answer would note this key quote: a pure transformer has no recurrence/convolution, so it needs this extra signal. Another way to see why it’s needed: without positional encodings, a self-attention layer can attend equally to all tokens but has no bias to prefer one order over another. In that case, “John likes cats” and “Cats like John” would look the same to the model.
Quick answer for interview: Transformers process all tokens in parallel through self-attention, which is inherently permutation-equivariant—it has no built-in notion of word order. Without positional encoding, “John likes cats” and “Cats like John” would be indistinguishable to the model. Positional encoding solves this by adding a unique vector to each token’s embedding that encodes its position in the sequence. This transforms the Transformer from order-agnostic to order-aware, enabling it to learn position-dependent patterns essential for language understanding.
In summary, positional encoding is a mechanism that enables a transformer to determine the position of each token in the sequence. It’s implemented by adding a specially designed vector to the token’s embedding, so the model can “know” which index (first, second, etc.) the word occupies. We will next look at the two main ways to construct these position vectors.
What is absolute positional encoding, and how does it work?
Absolute positional encodings give each position in the sequence a fixed, unique vector, independent of the words. This is what the original transformer was used for. Therefore, without any added signal, a transformer would treat “Cats chase mice” and “Mice chase cats” identically.
To make transformers order-aware, we inject positional information by adding a position vector to each word’s embedding. This vector can be computed in two ways:
Sinusoidal (fixed): Deterministic, no learned parameters, and supports extrapolation to longer sequences
Learned: Trainable, vectors stored in a lookup table, but limited to the sequence length seen during training
Sinusoidal encodings are deterministic and parameter-free. They can be considered the embedding space of a multi-band radio dial. Each dimension of the position vector is a different sine-wave “station.”
pos: Position index (e.g., 0 to 511)i: Dimension indexd: Total embedding size (e.g., 512)
How does it work?
Each even-indexed dimension uses
sin, and each odd-indexed dimension usescos. This alternating pattern introduces phase and frequency diversity, like having a different radio wave (frequency band) in each dimension.Each dimension cycles at a different frequency based on
. The term ...