Search⌘ K
AI Features

The Encoder: Understanding Input

Explore how the transformer encoder transforms raw input tokens into rich, contextualized representations by leveraging self-attention and multi-head attention mechanisms. Learn how residual connections and feedforward networks enhance training stability and model capacity, enabling encoder-based models to excel in understanding language tasks.

In the previous lesson, we introduced the transformer’s encoder-decoder structure and established that the encoder reads the full input sequence before any output is produced. Now we zoom into the encoder itself to understand exactly how it transforms raw input tokens into the rich, contextualized representations that downstream tasks depend on.

Consider a fundamental limitation of simple word embeddings. A lookup table assigns the same vector to the word “bank” regardless of whether the surrounding sentence discusses a financial institution or the edge of a river. The encoder’s job is to resolve this ambiguity. It takes these static, context-free embeddings and produces context-aware representations where each token’s vector encodes information about every other token in the sequence.

The encoder achieves this by stacking identical layers. The original transformer uses six such layers, each containing two sub-layers. The first is a multi-head self-attention mechanism, and the second is a position-wise feedforward network, which applies the same weighted-sum-plus-activation computation from earlier lessons independently to each token position. Each sub-layer is wrapped in a residual connectionA shortcut that adds a sub-layer's input directly to its output, allowing gradient signals to bypass transformations and flow more easily during training. followed by layer normalizationA technique that normalizes activations across features for each individual example, stabilizing the distribution of values flowing through the network.. These two mechanisms work together to stabilize training across deep stacks of layers.

Note: BERT, the most prominent encoder-only model, uses this exact architecture to power Google Search’s understanding of queries. When you type a search query, an encoder-based model produces contextualized representations that capture what you actually mean.

The following diagram illustrates the internal structure of a single encoder layer and how six such layers stack together to form the full encoder.

Transformer encoder architecture showing a single layer with self-attention and feedforward sub-layers stabilized by residual connections, stacked six times to form the complete encoder
Transformer encoder architecture showing a single layer with self-attention and feedforward sub-layers stabilized by residual connections, stacked six times to form the complete encoder

With this high-level structure in place, let us examine the mechanism that gives the encoder its power: self-attention.

Self-attention inside the encoder

Computing attention scores

Self-attention allows every token in a sequence to directly interact with every other token. Each input token’s embedding is projected into three separate vectors using learned weight matrices. These three vectors are called the ...