Search⌘ K
AI Features

Pre-training Dataset and Applications of VideoBERT

Explore how VideoBERT is pre-trained on a large dataset of instructional YouTube videos and understand its key applications. Learn how VideoBERT predicts future visual tokens, generates videos from text inputs, and captions videos effectively, enhancing video and language representation learning.

Data source and preprocessing

In order for VideoBERT to learn better language and video representations, we need a large number of videos. We don't use random videos for pre-training; instead, we use instructional videos. How do we obtain instructional videos? Researchers have used instructional videos from YouTube to form their dataset. They filtered out YouTube videos related to cooking using the YouTube video annotation system. Out of these filtered videos, they only included videos whose duration was less than 15 minutes. In ...