Search⌘ K

Extracting Embeddings From All Encoder Layers of BERT

Explore the process of extracting token embeddings from all encoder layers of the pre-trained BERT model. Learn the differences between embeddings from the final layer and all layers, understand their shapes and uses, and see how concatenating embeddings improves task performance. Discover how to use the transformers library to access these embeddings for practical NLP applications.

We've extract the embeddings obtained from the final encoder layer of the pre-trained model. Now the question is, should we consider the embeddings obtained only from the final encoder layer (final hidden state), or should we also consider the embeddings obtained from all the encoder layers (all hidden states)? Let's explore this.

Let's represent the input embedding layer with h0h_0, the first encoder layer (first hidden layer) with h1h_1, the second encoder layer (second hidden layer) with h2h_2, and so on to the final twelfth encoder layer, h12h_{12}, as shown in the following figure:

Pre-trained BERT
Pre-trained BERT

Instead of taking the embeddings (representations) only from the final encoder layer, the researchers of BERT have experimented with taking embeddings from different encoder layers.

For ...