The Encoder: Understanding Input

Explore how the transformer encoder transforms raw input tokens into rich, contextualized representations by leveraging self-attention and multi-head attention mechanisms. Learn how residual connections and feedforward networks enhance training stability and model capacity, enabling encoder-based models to excel in understanding language tasks.

We'll cover the following...

Self-attention inside the encoder
- Computing attention scores
- Bidirectional context through unmasked attention
Multi-head attention and parallel subspaces
- How heads divide the representation space
Residual connections and feedforward processing
- The position-wise feedforward network
Conclusion

In the previous lesson, we introduced the transformer’s encoder-decoder structure and established that the encoder reads the full input sequence before any output is produced. Now we zoom into the encoder itself to understand exactly how it transforms raw input tokens into the rich, contextualized representations that downstream tasks depend on.

Consider a fundamental limitation of simple word embeddings. A lookup table assigns the same vector to the word “bank” regardless of whether the surrounding sentence discusses a financial institution or the edge of a river. The encoder’s job is to resolve this ambiguity. It takes these static, context-free embeddings and produces context-aware representations where each token’s vector encodes information about every other token in the sequence.

The encoder achieves this by stacking identical layers. The original transformer uses six such layers, each containing two sub-layers. The first is a multi-head self-attention mechanism, and the second is a position-wise feedforward network, which applies the same weighted-sum-plus-activation computation from earlier lessons independently to each token position. Each sub-layer is wrapped in a residual connectionA shortcut that adds a sub-layer's input directly to its output, allowing gradient signals to bypass transformations and flow more easily during training. followed by layer normalizationA technique that normalizes activations across features for each individual example, stabilizing the distribution of values flowing through the network.. These two mechanisms work together to stabilize training across deep stacks of layers.

Note: BERT, the most prominent encoder-only model, uses this exact architecture to power Google Search’s understanding of queries. When you type a search query, an encoder-based model produces contextualized representations that capture what you actually mean.

The following diagram illustrates the internal structure of a single encoder layer and how six such layers stack together to form the full encoder.

1.LLM Application Architectures

2.Challenges and Risks

3.Transformers and Attention

4.Vector Databases

5.Prompt Engineering

Cloud Lab

6.Fine-Tuning

Cloud Lab

7.Model Context with LangChain

8.Agentic Workflows

Cloud Lab

9.Retrieval Augmented Generation (RAG)

Cloud Lab

Cloud Lab

10.LLM Evaluation

Cloud Lab

The Encoder: Understanding Input

Self-attention inside the encoder

Computing attention scores