Fill Mask
Explore the fill-mask technique in NLP using Hugging Face pipelines to predict missing words within sentences. Understand how masked language modeling works with models like BERT and DeBERTa, see its applications, domain-specific adaptations, and learn best practices for effective usage in Python.
We'll cover the following...
- Introduction to Masked Language Modeling (MLM)
- How the fill-mask pipeline works
- Using the fill-mask pipeline in Python
- Understanding mask tokens
- Exploring domain-specific models
- Multilingual models
- Advanced: Top-k predictions and probabilities
- Datasets used for MLM pretraining
- Common mistakes and pitfalls
- Executing fill-mask examples
- Summary
Masked Language Modeling (MLM), also known as fill-mask, is a foundational technique in modern NLP.
It enables a model to “guess” a missing token in a sequence using contextual clues. MLM is primarily used during the pre-training phase of models like BERT to learn language context in a bidirectional manner. This capability, however, translates directly into end user applications such as smart text completion and grammar checking in writing tools; data cleaning by filling in missing or corrupted fields; and improving semantic search by better understanding the context of incomplete queries.
Introduction to Masked Language Modeling (MLM)
Masked Language Modeling (MLM) is one of the fundamental training tasks behind modern transformer-based language models. Instead of predicting the next token (like GPT models do), MLM models learn by guessing missing tokens inside a sentence.
During training, some tokens in a text are randomly replaced by a special token, usually <mask>. The model then tries to reconstruct the original text.
Example training input: "The capital of France is ."
The model is never shown the missing token. It must infer it based on the surrounding context.
This simple idea evolved into a breakthrough technique because it forces the model to understand the relationships between words, grammatical structure, and world knowledge, rather than just memorizing text.
Fun fact: BERT was the first major model to use MLM and it changed NLP forever. Before 2018, most NLP benchmarks were dominated by sequential RNN/LSTM models. BERT crushed them.
How the fill-mask pipeline works
When we use Hugging Face’s fill-mask pipeline, we simulate what a masked model does in training:
Provide a text with exactly one mask token (
<mask>or[MASK]).The model predicts the most likely replacements.
Each candidate has a probability score.
It is essentially a single-step MLM inference. If you supply a sentence without a mask, the model fails or produces meaningless results. If you use multiple masks, most base models won’t know how to handle it.
Rule of thumb: MLM = one mask per sentence during inference.
Using the fill-mask pipeline in Python
Before diving into advanced concepts, let’s start with a clean example using a widely adopted and well-performing model: DeBERTa v3. We will ask the model to complete a factual sentence. ...