Positional Encodings
Learn how positional encodings enable Transformers to understand word order by injecting sequence information using absolute or relative methods.
Interviewers at top AI labs often ask about positional encodings in Transformers because this question probes fundamental understanding of how sequence models work. The Transformer architecture—the basis for models like BERT, GPT, and others—does not use recurrence or convolution. That means it has no built-in notion of word order, unlike an RNN that processes tokens individually. Without an extra signal, a Transformer would treat a sentence as a “bag of words.” For example, the sentences “John likes cats” and “Cats like John” would look identical to the model, even though their meanings are very different. Positional encodings are the mechanism for injecting order information into the Transformer.
Interviewers want to know if you understand why this is necessary and how the two main approaches—absolute vs. relative positional encodings—differ. A strong answer will explain that positional encoding tells the Transformer where in the sequence each token is (allowing it to distinguish “first word” from “second word”, etc.), and will show awareness of trade-offs and modern variants of this idea. A clear response should demonstrate that you grasp intuition and the mechanics. The interviewer checks: Can you articulate why a parallel-attention model needs positional information? Can you explain how absolute encodings (like the original sinusoidal scheme) work, vs. relative encodings (which capture token distances)? Can you discuss why one might choose one method in practice?
A top candidate will also mention variations (learned vs. fixed embeddings, rotary or bias-based methods) and relate this to real tasks. Finally, even though new architectures (Mixture-of-Experts,
What exactly is positional encoding?
Transformers process all tokens in parallel through self-attention, so they have no inherent information about sequence order. Positional encoding is the extra signal we add to tell the model where each word sits in the sequence. In practice, we assign each position in the input (1st word, 2nd word, etc.) a unique vector and add that to the token’s word embedding. You can think of it like giving each word a timestamp or a coordinate. For example, consider the French sentence “Je suis étudiant” (“I am a student”). Before feeding it into a Transformer encoder, we take the embedding of each word (the green bars below) and add a positional vector (yellow bars) that encodes that word’s position in the sentence.
The model can use that information in attention and later layers by adding the position vector to each word embedding. In effect, the Transformer can now distinguish “the first word” from “the second word,” etc. This turns the Transformer from order-agnostic to order-aware. Intuitively, you might imagine the positional encoding as a kind of barcode: it uniquely labels token positions so that downstream layers can factor in where in the sequence each token appeared. This is critical for language tasks. For example, with proper positional encoding, “The cat sat on the mat.” is very different from “The mat sat on the cat,” and the model can learn that difference.
Originally, positional encodings were introduced in the “Attention Is All You Need” paper, which states that “we must inject some information about the relative or absolute position of the tokens in the sequence.” The paper then adds these positional vectors to the input embeddings at the bottom of the model. A strong interview answer would note this key quote: a pure Transformer has no recurrence/convolution, so it needs this extra signal. Another way to see why it’s needed: without positional encodings, a self-attention layer can attend equally to all tokens but has no bias to prefer one order over another. In that case, “John likes cats” and “Cats like John” would look the same to the model.
In summary, positional encoding is a mechanism to tell a Transformer the position of each token in the sequence. It’s implemented by adding a specially-designed vector to the token’s embedding, so the model can “know” which index (first, second, etc.) the word occupies. We will next look at the two main ways to construct these position vectors.
What is absolute positional encoding?
Absolute positional encodings give each position in the sequence a fixed, unique vector, independent of the words. This is what the original Transformer used. Therefore, without any added signal, a Transformer would treat “Cats chase mice” and “Mice chase cats” identically.
To make Transformers order-aware, we inject positional information by adding a position vector to each word’s embedding. This vector can be computed in two ways:
Sinusoidal (fixed): Deterministic, no learned parameters, and supports extrapolation to longer sequences
Learned: Trainable, vectors stored in a lookup table, but limited to the sequence length seen during training
Sinusoidal encodings are deterministic and parameter-free. They can be considered the embedding space of a multi-band radio dial. Each dimension of the position vector is a different sine-wave “station.”
pos
: Position index (e.g., 0 to 511)i
: Dimension indexd
: Total embedding size (e.g., 512)
How does it work?
Each even-indexed dimension uses
sin
, and each odd-indexed dimension usescos
. This alternating pattern introduces phase and frequency diversity, like having a different radio wave (frequency band) in each dimension.Each dimension cycles at a different frequency based on
. The term ...