Use Case: Implementing BERT

To use a pretrained transformer model from the Hugging Face repository, we need three components:

  • Tokenizer: Responsible for splitting a long bit of text (such as a sentence) into smaller tokens.

  • config: Contains the configuration of the model.

  • Model: Takes in the tokens, looks up the embeddings, and produces the final outputs using the provided inputs.

We can ignore the config because we’re using the pretrained model as is. However, to show all aspects of this process, we’ll use the configuration nevertheless.

Implementing and using the tokenizer

First, we’ll look at how to download the tokenizer. We can do this using the transformers library. Simply call the from_pretrained() function provided by the PreTrainedTokenizerFast base class:

Get hands-on with 1200+ tech skills courses.