Search⌘ K
AI Features

Understanding BERT

Explore the architecture and functionality of BERT, an encoder-only transformer model used for natural language processing tasks. Understand how BERT uses special tokens and embeddings to handle sequence classification, token classification, question answering, and multiple-choice tasks. Discover BERT's pretraining with masked language modeling and next sentence prediction, and how it enables strong language understanding for downstream applications.

Bidirectional Encoder Representation from Transformers (BERT) is a transformer model among a plethora of transformer models that have come to light over the past few years.

BERT was introduced in the paper BERT—Pre-training of Deep Bidirectional Transformers for Language UnderstandingDelvin et al. (https://arxiv.org/pdf/1810.04805.pdf). The transformer models are divided into two main factions:

  • Encoder-based models

  • Decoder-based (autoregressive) models

In other words, either the encoder or the decoder part of the transformer provides the foundation for these models, compared to using both the encoder and the decoder. The main difference between the two is how attention is used. Encoder-based models use bidirectional attention, whereas decoder-based models use autoregressive (that is, left to right) attention.

BERT is an encoder-based transformer model. It takes an input sequence (a collection of tokens) and produces an encoded output sequence. The figure below depicts the high-level architecture of BERT :

The high-level architecture of BERT
The high-level architecture of BERT

It takes a set of input tokens and produces a sequence of hidden representations generated using several hidden layers.

Now, let’s discuss a few details pertinent to BERT, such as inputs consumed by BERT and the tasks it’s designed to solve. ...