# Transformers Building Blocks

Learn why we use skip connections and layer normalization inside a transformer.

## We'll cover the following

## Short residual skip connections

In language, there is a significant notion of a wider understanding of the world and our ability to combine ideas. Humans extensively utilize these top-down influences (our expectations) to combine words in different contexts.

In a very rough manner, skip connections give a transformer a tiny ability to

allow the representationsof different levels of processing tointeract.

With the forming of multiple paths, we can “pass” our higher-level understanding of the last layers to the previous layers. This allows us to re-modulate how we understand the input. Again, this is the same idea as human top-down understanding, which is nothing more than expectations.

## Layer normalization

Let’s open the Layer Norm black box.

In

Layer Normalization(LN), the mean and variance arecomputed across channels and spatial dims.

In language, each word is a vector. Since we are dealing with vectors, we only have one spatial dimension.

$\mu_{n}=\frac{1}{K} \sum_{k=1}^{K} x_{nk}$

$\sigma_{n}^{2}=\frac{1}{K} \sum_{k=1}^{K}\left(x_{nk}-\mu_{n}\right)^{2}$

$\hat{x}_{nk}= \frac{x_{nk}-\mu_{n}}{\sqrt{\sigma_{n}^{2}+\epsilon}}, \hat{x}_{nk} \in R$

$\mathrm{LN}_{\gamma, \beta}\left(x_{n}\right) =\gamma \hat{x}_{n}+\beta ,x_{n} \in R^{K} ,$

where $\gamma$ and $\beta$ are trainable parameters.

In a 4D tensor with merged spatial dimensions, we can visualize this with the following figure:

Get hands-on with 1200+ tech skills courses.