Search⌘ K
AI Features

The Transformer's Encoder

Explore the architecture of the transformer's encoder, including its multi-head self-attention, normalization, and feed-forward layers with skip connections. Understand how these components work together to process input sequences and stabilize training for natural language processing.

Even though this could be a stand-alone building block, the creators of the transformer add another stack of two linear layers with an activation in-between and renormalize it along with another skip connection.

Add linear layers to form the encoder

Suppose xx is the output of the multi-head self-attention. What we will depict as linear in the diagram will look something like this:

import torch
import torch.nn as nn

dim = 512
dim_linear_block = 1024 ## usually a multiple of dim
dropout = 0.1

norm = nn.LayerNorm(dim)
linear = nn.Sequential(
            nn.Linear(dim, dim_linear_block),
            nn.ReLU()
           
...