VideoBERT Model

Explore how VideoBERT integrates video and language representations by pre-training on instructional videos using masked token prediction and linguistic-visual alignment tasks. Understand the process of extracting linguistic and visual tokens, the combined training objectives, and how to fine-tune VideoBERT for downstream applications such as video captioning and frame prediction.

We'll cover the following...

Pre-training a VideoBERT model
Cloze task
- Example: Cooking video
- Video for Training the model
Linguistic-visual alignment
The final pre-training objective

Now we'll learn about yet another interesting variant of BERT called VideoBERT. As the name suggests, along with learning the representation of language, VideoBERT also learns the representation of video. It is the first model that learns the representation of both video and language in a joint manner.

Just as we used a pre-trained BERT model and fine-tuned it for downstream tasks, we can also use a pre-trained VideoBERT model and fine-tune it for many interesting downstream tasks. VideoBERT is used for tasks such as image caption generation, video captioning, predicting the next frames of a video, and more.

Let's explore how exactly the VideoBERT model is pre-trained using the cloze task and linguistic-visual alignment.

Cloze task

First, let's see how VideoBERT is pre-trained using the cloze task. In order to pre-train VideoBERT, we use instructional videos such as cooking videos. But why instructional videos? Why can't we use any random videos? Let's explain with an example.

Example: Cooking video

Consider a video where someone is teaching us how to cook. Say the speaker is saying, 'Cut lemon into slices.' As we hear the speaker saying 'cut lemon into slices'they will also visually show us how they are cutting the lemon into slices, right? This is shown in the ...

1.Before We Start

2.Starting Off with BERT

3.A Primer on Transformers

Project

4.Understanding the BERT Model

5.Getting Hands-On with BERT

6.Exploring BERT Variants

7.Different BERT Variants

8.BERT Variants—Based on Knowledge Distillation

9.Applications of BERT

10.Exploring BERTSUM for Text Summarization

11.Applying BERT to Other Languages

12.Exploring Sentence and Domain-Specific BERT

13.Working with VideoBERT, BART, and More

14.Conclusion

Project

VideoBERT Model

Pre-training a VideoBERT model

Cloze task

Example: Cooking video