The Encoder: Understanding Input
Explore how the transformer encoder transforms raw input tokens into rich, contextualized representations by leveraging self-attention and multi-head attention mechanisms. Learn how residual connections and feedforward networks enhance training stability and model capacity, enabling encoder-based models to excel in understanding language tasks.
We'll cover the following...
In the previous lesson, we introduced the transformer’s encoder-decoder structure and established that the encoder reads the full input sequence before any output is produced. Now we zoom into the encoder itself to understand exactly how it transforms raw input tokens into the rich, contextualized representations that downstream tasks depend on.
Consider a fundamental limitation of simple word embeddings. A lookup table assigns the same vector to the word “bank” regardless of whether the surrounding sentence discusses a financial institution or the edge of a river. The encoder’s job is to resolve this ambiguity. It takes these static, context-free embeddings and produces context-aware representations where each token’s vector encodes information about every other token in the sequence.
The encoder achieves this by stacking identical layers. The original transformer uses six such layers, each containing two sub-layers. The first is a multi-head self-attention mechanism, and the second is a position-wise feedforward network, which applies the same weighted-sum-plus-activation computation from earlier lessons independently to each token position. Each sub-layer is wrapped in a
Note: BERT, the most prominent encoder-only model, uses this exact architecture to power Google Search’s understanding of queries. When you type a search query, an encoder-based model produces contextualized representations that capture what you actually mean.
The following diagram illustrates the internal structure of a single encoder layer and how six such layers stack together to form the full encoder.
With this high-level structure in place, let us examine the mechanism that gives the encoder its power: self-attention.
Self-attention inside the encoder
Computing attention scores
Self-attention allows every token in a sequence to directly interact with every other token. Each input token’s embedding is projected into three separate vectors using learned weight matrices. These three vectors are called the ...