Transformer Architecture

Explore the fundamental systems that make transformer models work, including positional encoding, masking techniques, layer normalization, and Flash Attention. Understand why these components matter and how design choices impact model performance and training stability in AI and large language model development.

We'll cover the following...

Why do transformers need positional encoding, and what is RoPE?
What is masking and when is each type used?
Why do transformers use layer normalization instead of batch normalization?
How does Flash Attention make the quadratic cost of attention practical?
What’s next?

Attention is the core engine of the transformer, but it cannot run alone. Left unmodified, the architecture has no concept of word order, no way to hide future tokens during training, unstable gradients at depth, and a memory footprint that makes long contexts impractical. This lesson covers the four systems that solve those problems: positional encoding, masking, layer normalization, and Flash Attention.

These are not minor implementation details. Each one is an active interview topic, and the answers interviewers are looking for are almost always about why a design choice was made, not just what it is.

The original 2017 transformer used sinusoidal positional encodings, post-layer-norm, and standard attention. Every one of those choices has been revised in modern LLMs. RoPE replaced sinusoidal PE. Pre-norm replaced post-norm. Flash Attention replaced standard attention. Understanding why each replacement happened is more valuable than memorizing the original design.

Why do transformers need positional encoding, and what is RoPE?

Self-attention treats the input as a set, not a sequence. The operation is permutation-equivariant: shuffling the input tokens produces the same outputs, just reordered. This means “the cat chased the dog” and “the dog chased the cat” are indistinguishable to the model without explicit position information. This is not a training limitation that can be overcome with more data. It is a property of the operation itself: without positional encoding, the model has no mechanism to distinguish position 1 from position 10. Positional encoding fixes this by injecting order into the token representations before they enter the attention layers.

The original transformer used sinusoidal positional encodings: deterministic vectors computed from sine and cosine functions at different frequencies, added to the token embeddings. They are parameter-free and can in principle generalize to sequence lengths longer than those seen during training. GPT-2 and BERT replaced these with learned positional embeddings, a simple trainable lookup table where each position gets its own vector. More expressive, but hard-capped at the maximum training length.

Both approaches encode absolute position, which creates a problem: what matters in language is usually relative distance, not absolute index. Whether a pronoun refers to a noun 3 positions back or 300 positions back, the relationship is the same type. Absolute encodings conflate position and distance.

Rotary Position Embedding (RoPE), introduced in 2021, is now the standard for modern LLMs. The key insight is to encode position by rotating the Q and K vectors rather than adding anything to the embeddings. Each position i applies a rotation matrix to $Q_i$ and $K_i$ ...

1.How AI Models Work

2.LLM Training, Fine-Tuning, and Optimization

3.AI System Design

Transformer Architecture

Why do transformers need positional encoding, and what is RoPE?