The Transformer's Encoder
Explore the architecture of the transformer's encoder, including its multi-head self-attention, normalization, and feed-forward layers with skip connections. Understand how these components work together to process input sequences and stabilize training for natural language processing.
We'll cover the following...
We'll cover the following...
Even though this could be a stand-alone building block, the creators of the transformer add another stack of two linear layers with an activation in-between and renormalize it along with another skip connection.
Add linear layers to form the encoder
Suppose is the output of the multi-head self-attention. What we will depict as linear in the diagram will look something like this:
import torch
import torch.nn as nn
dim = 512
dim_linear_block = 1024 ## usually a multiple of dim
dropout = 0.1
norm = nn.LayerNorm(dim)
linear = nn.Sequential(
nn.Linear(dim, dim_linear_block),
nn.ReLU()
...