...

Understanding BERT

Let's learn about BERT.

We'll cover the following...

BERT architecture
BERT input format
How is BERT trained?

We'll now explore the most influential and commonly used Transformer model, BERT. BERT is introduced in Google's research paperDevlin, J., Chang, M., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. ArXiv./abs/1810.04805.

What does BERT do exactly? To understand what BERT outputs, let's dissect the name:

Bidirectional: Training on the text data is bidirectional, which means each input sentence is processed from left to right as well as from right to left.
Encoder: An encoder encodes the input sentence.
Representations: A representation is a word vector.
Transformers: The architecture is transformer-based.

BERT is essentially a trained transformer encoder stack. Input into BERT is a sentence, and the output is a sequence of word vectors. The word vectors are contextual, which means that a word vector is assigned to a word based on the input sentence. In short, BERT outputs contextual word representations.

We have already seen a number of issues that transformers aim to solve. Another problem that transformers address concerns word vectors. Earlier, we saw that word vectors are context-free; the word vector for a word is always the same independent of the sentence it is used in. The following diagram explains this problem:

Press + to interact

Getting Started

Core Operations with spaCy

Linguistic Features

Rule-Based Matchmaking

Working with Word Vectors and Semantic Similarity

Putting Everything Together: Semantic Parsing with spaCy

Assessment: spaCy Features

Auto-Tagging System for Content Categorization

Customizing spaCy Models

Text Classification with spaCy

spaCy and Transformers

Putting Everything Together: Designing a Chatbot with spaCy

Appendix

Conclusion

Assessment - Machine Learning with spaCy

Understanding BERT