Transformer Architecture: Residuals and Normalization

Learn about the residuals and normalization in the transformer architecture.

Another important characteristic of the transformer models is the existence of the residual connections and the normalization layers in between the individual layers of the transformer model.

Residual connections

Residual connections are formed by adding a given layer’s output to the output of one or more layers ahead. This, in turn, forms shortcut connections through the model and provides a stronger gradient flow by reducing the changes of the phenomenon known as vanishing gradients. The vanishing gradients problem causes the gradients in the layers closest to the inputs to be very small so that the training in those layers is hindered. The residual connections for deep learning models were popularized by the paper Deep Residual Learning for Image Recognition.

Get hands-on with 1200+ tech skills courses.