11:[["$","$La0",null,{"props":{"lessonContent":{"components":[{"type":"Quiz","mode":"edit","content":{"questions":[{"questionText":"What’s the purpose of positional embeddings in the Video Transformer Network (VTN) architecture?","questionOptions":[{"text":"To represent the time frames of each video frame","id":"f1JgVS3WIu58qDlJ1jVK0","correct":false,"explanation":{"mdText":"","mdHtml":""},"mdHtml":"

To represent the time frames of each video frame

\n"},{"text":"To provide spatial information to the transformer encoder","id":"Rua7f71E9J4PjzvzKnkDW","correct":false,"explanation":{"mdText":"","mdHtml":""},"mdHtml":"

To provide spatial information to the transformer encoder

\n"},{"text":"To facilitate attention across spatial dimensions","id":"5xbHvZ3OGhfsCVqc-CWL5","correct":false,"explanation":{"mdText":"","mdHtml":""},"mdHtml":"

To facilitate attention across spatial dimensions

\n"},{"text":"To introduce a time dimension and enable the modeling of temporal relations","id":"jHAM76y5C7Czh93EEfIVt","correct":true,"explanation":{"mdText":"Positional embeddings in the VTN architecture represent the time frames of each video frame. They are combined with the 2D embeddings extracted by the convolutional neural network (CNN) backbone, introducing a time dimension $(T.H.W.C)$ to the data. This enables the transformer encoder to model both spatial and temporal relations crucial for video analysis tasks.","mdHtml":"$a1"},"mdHtml":"

To introduce a time dimension and enable the modeling of temporal relations

\n"}],"id":"0_question_0","questionTextHtml":"

What’s the purpose of positional embeddings in the Video Transformer Network (VTN) architecture?

\n"},{"id":"WS82UjIDfgJpqaICHWwhC","questionText":"How does the VTN architecture incorporate temporal information for video classification?","questionOptions":[{"text":"By utilizing spatial embeddings","id":"Qc--49VeJLkSMzZ_TGMGM","correct":false,"explanation":{"mdText":"","mdHtml":""},"mdHtml":"

By utilizing spatial embeddings

\n"},{"text":"By processing each frame with a recurrent neural network","id":"WCzFkBvSCbpKDU4usUS3C","correct":false,"explanation":{"mdText":"","mdHtml":""},"mdHtml":"

By processing each frame with a recurrent neural network

\n"},{"text":"By combining 2D embeddings with positional embeddings","id":"CqaSrW0RvANAJemVHwlPy","correct":true,"explanation":{"mdText":"The VTN architecture processes each video frame with a convolutional neural network (CNN) backbone, extracting 2D embeddings. These embeddings are then combined with positional embeddings representing the time frames. This combination enables the model to capture both spatial and temporal relations, crucial for video classification.","mdHtml":"

The VTN architecture processes each video frame with a convolutional neural network (CNN) backbone, extracting 2D embeddings. These embeddings are then combined with positional embeddings representing the time frames. This combination enables the model to capture both spatial and temporal relations, crucial for video classification.

\n"},"mdHtml":"

By combining 2D embeddings with positional embeddings

\n"},{"text":"By performing direct classification without temporal consideration","id":"C6zdfWhAXrDydMYb4nJ71","correct":false,"explanation":{"mdText":"","mdHtml":""},"mdHtml":"

By performing direct classification without temporal consideration

\n"}],"multipleAnswers":false,"questionTextHtml":"

How does the VTN architecture incorporate temporal information for video classification?

\n"},{"id":"1R-5NrZCdzJl_XzK165g8","questionText":"What does the **CLS** token represent in the context of the Video Transformer Network (VTN) architecture?","questionOptions":[{"text":"It’s a token reserved for natural language processing.","id":"kSewoRNz9c85GN0JdSTD6","correct":false,"explanation":{"mdText":"","mdHtml":""},"mdHtml":"

It’s a token reserved for natural language processing.

\n"},{"text":"It’s a token added to represent the global context of the video.","id":"ylYQvJDv3zren4PC-jU1u","correct":true,"explanation":{"mdText":"In the context of the Video Transformer Network (VTN) architecture, **CLS** token is a special token added to the processed video data. It includes concatenated features from all frames in the video, allowing the model to capture the global context of the video. This information is then used for classification.","mdHtml":"

In the context of the Video Transformer Network (VTN) architecture, CLS token is a special token added to the processed video data. It includes concatenated features from all frames in the video, allowing the model to capture the global context of the video. This information is then used for classification.

\n"},"mdHtml":"

It’s a token added to represent the global context of the video.

\n"},{"text":"It’s a token used for convolutional neural network operations.","id":"kSPn5tIUx0bk_oKHxvUGm","correct":false,"explanation":{"mdText":"","mdHtml":""},"mdHtml":"

It’s a token used for convolutional neural network operations.

\n"},{"text":"It’s a token indicating the start of a video sequence.","id":"lczaEaeuQpb9uJrlcVvz-","correct":false,"explanation":{"mdText":"","mdHtml":""},"mdHtml":"

It’s a token indicating the start of a video sequence.

\n"}],"multipleAnswers":false,"questionTextHtml":"

What does the CLS token represent in the context of the Video Transformer Network (VTN) architecture?

\n"},{"id":"_TC9CC4JzWG0T-l8KxuoL","questionText":"In the simulated code example, what does the \"temporal representation\" represent?","questionOptions":[{"text":"Binary features of video frames","id":"1tPsDN_isIrjiMa24w6q2","correct":false,"explanation":{"mdText":"","mdHtml":""},"mdHtml":"

Binary features of video frames

\n"},{"text":"Output of the spatio-temporal transformer","id":"TZW_NMgDVWUIMDxM3RCmk","correct":true,"explanation":{"mdText":"In the simulated code example, the \"temporal representation\" is the output of the spatio-temporal transformer mechanism. It’s obtained by summing the features over time for each frame, providing a simplified representation of the model's output for recognizing actions.","mdHtml":"

In the simulated code example, the “temporal representation” is the output of the spatio-temporal transformer mechanism. It’s obtained by summing the features over time for each frame, providing a simplified representation of the model’s output for recognizing actions.

\n"},"mdHtml":"

Output of the spatio-temporal transformer

\n"},{"text":"Number of frames in the video","id":"eTEyJrY9vdlwt2LjwebYL","correct":false,"explanation":{"mdText":"","mdHtml":""},"mdHtml":"

Number of frames in the video

\n"},{"text":"Positional embeddings in the transformer","id":"BbwY_llvoeuGjbbF6Rou7","correct":false,"explanation":{"mdText":"","mdHtml":""},"mdHtml":"

Positional embeddings in the transformer

\n"}],"multipleAnswers":false,"questionTextHtml":"

In the simulated code example, what does the “temporal representation” represent?

To represent the time frames of each video frame

\n"},{"text":"To provide spatial information to the transformer encoder","id":"Rua7f71E9J4PjzvzKnkDW","correct":false,"explanation":{"mdText":"","mdHtml":""},"mdHtml":"

To provide spatial information to the transformer encoder

\n"},{"text":"To facilitate attention across spatial dimensions","id":"5xbHvZ3OGhfsCVqc-CWL5","correct":false,"explanation":{"mdText":"","mdHtml":""},"mdHtml":"

To facilitate attention across spatial dimensions

To introduce a time dimension and enable the modeling of temporal relations

\n"}],"id":"0_question_0","questionTextHtml":"

What’s the purpose of positional embeddings in the Video Transformer Network (VTN) architecture?

By utilizing spatial embeddings

\n"},{"text":"By processing each frame with a recurrent neural network","id":"WCzFkBvSCbpKDU4usUS3C","correct":false,"explanation":{"mdText":"","mdHtml":""},"mdHtml":"

By processing each frame with a recurrent neural network

\n"},"mdHtml":"

By combining 2D embeddings with positional embeddings

\n"},{"text":"By performing direct classification without temporal consideration","id":"C6zdfWhAXrDydMYb4nJ71","correct":false,"explanation":{"mdText":"","mdHtml":""},"mdHtml":"

By performing direct classification without temporal consideration

\n"}],"multipleAnswers":false,"questionTextHtml":"

How does the VTN architecture incorporate temporal information for video classification?

It’s a token reserved for natural language processing.

\n"},"mdHtml":"

It’s a token added to represent the global context of the video.

\n"},{"text":"It’s a token used for convolutional neural network operations.","id":"kSPn5tIUx0bk_oKHxvUGm","correct":false,"explanation":{"mdText":"","mdHtml":""},"mdHtml":"

It’s a token used for convolutional neural network operations.

\n"},{"text":"It’s a token indicating the start of a video sequence.","id":"lczaEaeuQpb9uJrlcVvz-","correct":false,"explanation":{"mdText":"","mdHtml":""},"mdHtml":"

It’s a token indicating the start of a video sequence.

\n"}],"multipleAnswers":false,"questionTextHtml":"

What does the CLS token represent in the context of the Video Transformer Network (VTN) architecture?

Binary features of video frames

\n"},"mdHtml":"

Output of the spatio-temporal transformer

\n"},{"text":"Number of frames in the video","id":"eTEyJrY9vdlwt2LjwebYL","correct":false,"explanation":{"mdText":"","mdHtml":""},"mdHtml":"

Number of frames in the video

\n"},{"text":"Positional embeddings in the transformer","id":"BbwY_llvoeuGjbbF6Rou7","correct":false,"explanation":{"mdText":"","mdHtml":""},"mdHtml":"

Positional embeddings in the transformer

\n"}],"multipleAnswers":false,"questionTextHtml":"

In the simulated code example, what does the “temporal representation” represent?

To represent the time frames of each video frame

\n"},{"text":"To provide spatial information to the transformer encoder","id":"Rua7f71E9J4PjzvzKnkDW","correct":false,"explanation":{"mdText":"","mdHtml":""},"mdHtml":"

To provide spatial information to the transformer encoder

\n"},{"text":"To facilitate attention across spatial dimensions","id":"5xbHvZ3OGhfsCVqc-CWL5","correct":false,"explanation":{"mdText":"","mdHtml":""},"mdHtml":"

To facilitate attention across spatial dimensions

To introduce a time dimension and enable the modeling of temporal relations

\n"}],"id":"0_question_0","questionTextHtml":"

What’s the purpose of positional embeddings in the Video Transformer Network (VTN) architecture?

By utilizing spatial embeddings

\n"},{"text":"By processing each frame with a recurrent neural network","id":"WCzFkBvSCbpKDU4usUS3C","correct":false,"explanation":{"mdText":"","mdHtml":""},"mdHtml":"

By processing each frame with a recurrent neural network

\n"},"mdHtml":"

By combining 2D embeddings with positional embeddings

\n"},{"text":"By performing direct classification without temporal consideration","id":"C6zdfWhAXrDydMYb4nJ71","correct":false,"explanation":{"mdText":"","mdHtml":""},"mdHtml":"

By performing direct classification without temporal consideration

\n"}],"multipleAnswers":false,"questionTextHtml":"

How does the VTN architecture incorporate temporal information for video classification?

It’s a token reserved for natural language processing.

\n"},"mdHtml":"

It’s a token added to represent the global context of the video.

\n"},{"text":"It’s a token used for convolutional neural network operations.","id":"kSPn5tIUx0bk_oKHxvUGm","correct":false,"explanation":{"mdText":"","mdHtml":""},"mdHtml":"

It’s a token used for convolutional neural network operations.

\n"},{"text":"It’s a token indicating the start of a video sequence.","id":"lczaEaeuQpb9uJrlcVvz-","correct":false,"explanation":{"mdText":"","mdHtml":""},"mdHtml":"

It’s a token indicating the start of a video sequence.

\n"}],"multipleAnswers":false,"questionTextHtml":"

What does the CLS token represent in the context of the Video Transformer Network (VTN) architecture?

Binary features of video frames

\n"},"mdHtml":"

Output of the spatio-temporal transformer

\n"},{"text":"Number of frames in the video","id":"eTEyJrY9vdlwt2LjwebYL","correct":false,"explanation":{"mdText":"","mdHtml":""},"mdHtml":"

Number of frames in the video

\n"},{"text":"Positional embeddings in the transformer","id":"BbwY_llvoeuGjbbF6Rou7","correct":false,"explanation":{"mdText":"","mdHtml":""},"mdHtml":"

Positional embeddings in the transformer

\n"}],"multipleAnswers":false,"questionTextHtml":"

In the simulated code example, what does the “temporal representation” represent?

\n"}],"comp_id":"I5lRgcICZ8rwUJTQDzmwq"},"iteration":0,"hash":0,"children":[{"text":""}],"status":"normal","contentID":"OK9Us6-v8xOFOtOZz-x_B","saveVersion":5},{"type":"SlateHTML","content":{"html":" ...","comp_id":"WeJ40RcPV9B7-OXSzHDKk"},"hash":1,"iteration":0}]},"isPreviewLesson":false,"pageType":"collection_lesson","aiCoachVideoUrl":"https://youtu.be/kgl8y9J3O6c","collectionDetailsSSR":{"title":"Transformers for Computer Vision Applications","summary":"This is a comprehensive course on vision transformers and their use cases in computer vision. You’ll begin by exploring the rise of transformers and attention mechanisms and their role in deep neural networks. \nYou’ll gain insights into self-attention mechanisms, multi-head attention, and the pros and cons of transformers building a strong foundation. Next, you’ll discover how transformers reshape image analysis. Comparing self-attention with convolutional encoders and understanding spatial vs. channel vs. temporal attention, you’ll grasp nuances in applying transformer architectures to visual data. \n\nThe course also explores spatiotemporal transformers, bridging the gap between static images and dynamic data. After completing this course, you’ll have the knowledge and skills to leverage transformer networks across diverse applications in deep learning and artificial intelligence.","details":"$a4","clos":["An understanding of transformers and attention mechanisms","Hands-on implementation of computer vision techniques with transformer models","The ability to apply transfer learning for image classification","A strong grasp of object detection and segmentation using transformers"],"toc":{"categories":[{"id":"hqqoabmmu","title":"Introduction","pages":[{"author_id":6586453712175104,"collection_id":6479851841912832,"page_id":4506676170063872,"id":4506676170063872,"title":"Introduction to the Course","is_preview":true,"parentIndex":"","editMode":false,"is_recovered":false,"type":"collection_lesson","can_edit":false,"is_standalone_module":false,"is_cloned":false,"brief_summary":"","slug":"introduction-to-the-course"}],"editMode":false,"type":"COLLECTION_CATEGORY","summary":"Get familiar with transformers in computer vision, covering key concepts and architectures."},{"id":"r7yp66shq","title":"Overview of Transformer Networks","pages":[{"author_id":6586453712175104,"collection_id":6479851841912832,"page_id":5211288875696128,"id":5211288875696128,"title":"Introduction to Transformers","is_preview":true,"parentIndex":"","editMode":false,"is_recovered":false,"type":"collection_lesson","can_edit":false,"is_standalone_module":false,"is_cloned":false,"brief_summary":"","slug":"introduction-to-transformers"},{"author_id":6586453712175104,"collection_id":6479851841912832,"page_id":6159840717832192,"id":6159840717832192,"title":"The Rise of Transformers","is_preview":true,"parentIndex":"","editMode":false,"is_recovered":false,"type":"collection_lesson","can_edit":false,"is_standalone_module":false,"is_cloned":false,"brief_summary":"","slug":"the-rise-of-transformers"},{"author_id":6586453712175104,"collection_id":6479851841912832,"page_id":6296582315835392,"id":6296582315835392,"title":"Inductive Bias in DNNs","is_preview":false,"parentIndex":"","editMode":false,"is_recovered":false,"type":"collection_lesson","can_edit":false,"is_standalone_module":false,"is_cloned":false,"brief_summary":"","slug":"inductive-bias-in-dnns"},{"author_id":6586453712175104,"collection_id":6479851841912832,"page_id":6421025369358336,"id":6421025369358336,"title":"Attention: General Deep Learning Idea","is_preview":false,"parentIndex":"","editMode":false,"is_recovered":false,"type":"collection_lesson","can_edit":false,"is_standalone_module":false,"is_cloned":false,"brief_summary":"","slug":"attention-general-deep-learning-idea"},{"author_id":6586453712175104,"collection_id":6479851841912832,"page_id":5234721103282176,"id":5234721103282176,"title":"Attention in NLP","is_preview":false,"parentIndex":"","editMode":false,"is_recovered":false,"type":"collection_lesson","can_edit":false,"is_standalone_module":false,"is_cloned":false,"brief_summary":"","slug":"attention-in-nlp"},{"author_id":6586453712175104,"collection_id":6479851841912832,"page_id":6514761212362752,"id":6514761212362752,"title":"Is Attention All We Need?","is_preview":false,"parentIndex":"","editMode":false,"is_recovered":false,"type":"collection_lesson","can_edit":false,"is_standalone_module":false,"is_cloned":false,"brief_summary":"","slug":"is-attention-all-we-need"},{"author_id":6586453712175104,"collection_id":6479851841912832,"page_id":6016449114144768,"id":6016449114144768,"title":"Quiz: Attention and Inductive Bias","is_preview":false,"parentIndex":"","editMode":false,"is_recovered":false,"type":"collection_lesson","can_edit":false,"is_standalone_module":false,"is_cloned":false,"brief_summary":"","slug":"quiz-attention-and-inductive-bias"},{"author_id":6586453712175104,"collection_id":6479851841912832,"page_id":4938705974067200,"id":4938705974067200,"title":"Self-Attention Mechanism","is_preview":false,"parentIndex":"","editMode":false,"is_recovered":false,"type":"collection_lesson","can_edit":false,"is_standalone_module":false,"is_cloned":false,"brief_summary":"","slug":"self-attention-mechanism"},{"author_id":6586453712175104,"collection_id":6479851841912832,"page_id":4624813131563008,"id":4624813131563008,"title":"Self-Attention Matrix Equations","is_preview":false,"parentIndex":"","editMode":false,"is_recovered":false,"type":"collection_lesson","can_edit":false,"is_standalone_module":false,"is_cloned":false,"brief_summary":"","slug":"self-attention-matrix-equations"},{"author_id":6586453712175104,"collection_id":6479851841912832,"page_id":5469206174498816,"id":5469206174498816,"title":"Multihead Attention","is_preview":false,"parentIndex":"","editMode":false,"is_recovered":false,"type":"collection_lesson","can_edit":false,"is_standalone_module":false,"is_cloned":false,"brief_summary":"","slug":"multihead-attention"},{"author_id":6586453712175104,"collection_id":6479851841912832,"page_id":6371961424576512,"id":6371961424576512,"title":"Encoder-Decoder Attention","is_preview":false,"parentIndex":"","editMode":false,"is_recovered":false,"type":"collection_lesson","can_edit":false,"is_standalone_module":false,"is_cloned":false,"brief_summary":"","slug":"encoder-decoder-attention"},{"author_id":6586453712175104,"collection_id":6479851841912832,"page_id":6332454513934336,"id":6332454513934336,"title":"Transformers Pros and Cons","is_preview":false,"parentIndex":"","editMode":false,"is_recovered":false,"type":"collection_lesson","can_edit":false,"is_standalone_module":false,"is_cloned":false,"brief_summary":"","slug":"transformers-pros-and-cons"},{"author_id":6586453712175104,"collection_id":6479851841912832,"page_id":6165330994659328,"id":6165330994659328,"title":"Unsupervised and Self-Supervised Pretraining","is_preview":false,"parentIndex":"","editMode":false,"is_recovered":false,"type":"collection_lesson","can_edit":false,"is_standalone_module":false,"is_cloned":false,"brief_summary":"","slug":"unsupervised-and-self-supervised-pretraining"},{"author_id":6586453712175104,"collection_id":6479851841912832,"page_id":4862119673724928,"id":4862119673724928,"title":"Quiz: Transformers and Multihead Attention","is_preview":false,"parentIndex":"","editMode":false,"is_recovered":false,"type":"collection_lesson","can_edit":false,"is_standalone_module":false,"is_cloned":false,"brief_summary":"","slug":"quiz-transformers-and-multihead-attention"}],"editMode":false,"type":"COLLECTION_CATEGORY","summary":"Grasp the fundamentals of transformer networks, attention mechanisms, and their impact on deep learning."},{"id":5279278489010176,"title":"Neural Machine Translation with a Transformer and Keras","pages":[],"editMode":false,"type":"COLLECTION_PROJECT","slug":"neural-machine-translation-with-a-transformer-and-keras"},{"id":"xjkxyubkj","title":"Transformers in Computer Vision","pages":[{"author_id":6586453712175104,"collection_id":6479851841912832,"page_id":6314702933852160,"id":6314702933852160,"title":"Introduction to Transformers in Computer Vision","is_preview":false,"parentIndex":"","editMode":false,"is_recovered":false,"type":"collection_lesson","can_edit":false,"is_standalone_module":false,"is_cloned":false,"brief_summary":"","slug":"introduction-to-transformers-in-computer-vision"},{"author_id":6586453712175104,"collection_id":6479851841912832,"page_id":5845856865222656,"id":5845856865222656,"title":"Encoder-Decoder Design Pattern","is_preview":false,"parentIndex":"","editMode":false,"is_recovered":false,"type":"collection_lesson","can_edit":false,"is_standalone_module":false,"is_cloned":false,"brief_summary":"","slug":"encoder-decoder-design-pattern"},{"author_id":6586453712175104,"collection_id":6479851841912832,"page_id":5081521756831744,"id":5081521756831744,"title":"Convolutional Encoders","is_preview":false,"parentIndex":"","editMode":false,"is_recovered":false,"type":"collection_lesson","can_edit":false,"is_standalone_module":false,"is_cloned":false,"brief_summary":"","slug":"convolutional-encoders"},{"author_id":6586453712175104,"collection_id":6479851841912832,"page_id":4964586541023232,"id":4964586541023232,"title":"Self-Attention vs. Convolution","is_preview":false,"parentIndex":"","editMode":false,"is_recovered":false,"type":"collection_lesson","can_edit":false,"is_standalone_module":false,"is_cloned":false,"brief_summary":"","slug":"self-attention-vs-convolution"},{"author_id":6586453712175104,"collection_id":6479851841912832,"page_id":6205981189079040,"id":6205981189079040,"title":"Quiz: Encoder-Decoder Architecture and Attention Mechanism in Computer Vision","is_preview":false,"parentIndex":"","editMode":false,"is_recovered":false,"type":"collection_lesson","can_edit":false,"is_standalone_module":false,"is_cloned":false,"brief_summary":"","slug":"quiz-encoder-decoder-architecture-and-attention-mechanism-in-computer-vision"},{"author_id":6586453712175104,"collection_id":6479851841912832,"page_id":6237284336402432,"id":6237284336402432,"title":"Spatial vs. Channel vs. Temporal Attention","is_preview":false,"parentIndex":"","editMode":false,"is_recovered":false,"type":"collection_lesson","can_edit":false,"is_standalone_module":false,"is_cloned":false,"brief_summary":"","slug":"spatial-vs-channel-vs-temporal-attention"},{"author_id":6586453712175104,"collection_id":6479851841912832,"page_id":5577204024737792,"id":5577204024737792,"title":"Local vs. Global Attention","is_preview":false,"parentIndex":"","editMode":false,"is_recovered":false,"type":"collection_lesson","can_edit":false,"is_standalone_module":false,"is_cloned":false,"brief_summary":"","slug":"local-vs-global-attention"},{"author_id":6586453712175104,"collection_id":6479851841912832,"page_id":5559101930864640,"id":5559101930864640,"title":"Pros and Cons of Attention in CV","is_preview":false,"parentIndex":"","editMode":false,"is_recovered":false,"type":"collection_lesson","can_edit":false,"is_standalone_module":false,"is_cloned":false,"brief_summary":"","slug":"pros-and-cons-of-attention-in-cv"},{"author_id":6586453712175104,"collection_id":6479851841912832,"page_id":5962224191537152,"id":5962224191537152,"title":"Quiz: Attention in Computer Vision","is_preview":false,"parentIndex":"","editMode":false,"is_recovered":false,"type":"collection_lesson","can_edit":false,"is_standalone_module":false,"is_cloned":false,"brief_summary":"","slug":"quiz-attention-in-computer-vision"}],"editMode":false,"type":"COLLECTION_CATEGORY","summary":"Break apart the application of transformers, attention mechanisms, and the encoder-decoder pattern in computer vision."},{"page_id":5569258708926464,"id":6068571806760960,"title":"Vision Transformer for Image Classification","pages":[],"editMode":false,"type":"PATH_EXTERNAL_PROJECT","author_id":6586453712175104,"collection_id":4556568393809920,"is_required":false,"detail_id":"project_6586453712175104_4556568393809920_5569258708926464","cover_image_serving_url":null,"collection_read_time":0,"page_count":0,"brief_summary":null,"course_url_slug":null,"assessments_keys":[],"projects_keys":[],"optional_lessons":[],"time_limit":null},{"id":"rqzumptst","title":"Transformers in Image Classification","pages":[{"author_id":6586453712175104,"collection_id":6479851841912832,"page_id":4740696550146048,"id":4740696550146048,"title":"Image Classification with Vision Transformer (ViT and DeiT)","is_preview":false,"parentIndex":"","editMode":false,"is_recovered":false,"type":"collection_lesson","can_edit":false,"is_standalone_module":false,"is_cloned":false,"brief_summary":"","slug":"image-classification-with-vision-transformer-vit-and-deit"},{"author_id":6586453712175104,"collection_id":6479851841912832,"page_id":5878296753209344,"id":5878296753209344,"title":"Shifter Window (Swin) Transformers","is_preview":false,"parentIndex":"","editMode":false,"is_recovered":false,"type":"collection_lesson","can_edit":false,"is_standalone_module":false,"is_cloned":false,"brief_summary":"","slug":"shifter-window-swin-transformers"},{"author_id":6586453712175104,"collection_id":6479851841912832,"page_id":5189798379782144,"id":5189798379782144,"title":"Quiz: Transformers in Image Classification","is_preview":false,"parentIndex":"","editMode":false,"is_recovered":false,"type":"collection_lesson","can_edit":false,"is_standalone_module":false,"is_cloned":false,"brief_summary":"","slug":"quiz-transformers-in-image-classification"}],"editMode":false,"type":"COLLECTION_CATEGORY","summary":"Grasp the fundamentals of ViT, DeiT, and Swin Transformers in image classification."},{"id":6486290880790528,"title":"Fine-Tuning Vision Transformers for Image Classification","pages":[],"editMode":false,"type":"COLLECTION_PROJECT","slug":"fine-tuning-vision-transformers-for-image-classification"},{"id":"31bx2lvd2","title":"Transformers in Object Detection","pages":[{"author_id":6586453712175104,"collection_id":6479851841912832,"page_id":4618187708301312,"id":4618187708301312,"title":"Object Detection Methods Review","is_preview":false,"parentIndex":"","editMode":false,"is_recovered":false,"type":"collection_lesson","can_edit":false,"is_standalone_module":false,"is_cloned":false,"brief_summary":"","slug":"object-detection-methods-review"},{"author_id":6586453712175104,"collection_id":6479851841912832,"page_id":6717336169742336,"id":6717336169742336,"title":"DEtection TRansformers (DETR)","is_preview":false,"parentIndex":"","editMode":false,"is_recovered":false,"type":"collection_lesson","can_edit":false,"is_standalone_module":false,"is_cloned":false,"brief_summary":"","slug":"detection-transformers-detr"},{"author_id":6586453712175104,"collection_id":6479851841912832,"page_id":5559032372920320,"id":5559032372920320,"title":"Quiz: Transformers in Object Detection","is_preview":false,"parentIndex":"","editMode":false,"is_recovered":false,"type":"collection_lesson","can_edit":false,"is_standalone_module":false,"is_cloned":false,"brief_summary":"","slug":"quiz-transformers-in-object-detection"}],"editMode":false,"type":"COLLECTION_CATEGORY","summary":"Take a closer look at object detection methods, from traditional approaches to DEtection TRansformers (DETR)."},{"id":"nsv25tkgr","title":"Transformers in Semantic Segmentation","pages":[{"author_id":6586453712175104,"collection_id":6479851841912832,"page_id":6251391324258304,"id":6251391324258304,"title":"Image Segmentation Using ConvNets","is_preview":false,"parentIndex":"","editMode":false,"is_recovered":false,"type":"collection_lesson","can_edit":false,"is_standalone_module":false,"is_cloned":false,"brief_summary":"","slug":"image-segmentation-using-convnets"},{"author_id":6586453712175104,"collection_id":6479851841912832,"page_id":5949525168226304,"id":5949525168226304,"title":"Image Segmentation Using Transformers","is_preview":false,"parentIndex":"","editMode":false,"is_recovered":false,"type":"collection_lesson","can_edit":false,"is_standalone_module":false,"is_cloned":false,"brief_summary":"","slug":"image-segmentation-using-transformers"},{"author_id":6586453712175104,"collection_id":6479851841912832,"page_id":5283825330552832,"id":5283825330552832,"title":"Quiz: Transformers in Semantic Segmentation","is_preview":false,"parentIndex":"","editMode":false,"is_recovered":false,"type":"collection_lesson","can_edit":false,"is_standalone_module":false,"is_cloned":false,"brief_summary":"","slug":"quiz-transformers-in-semantic-segmentation"}],"editMode":false,"type":"COLLECTION_CATEGORY","summary":"Focus on innovative methods using ConvNets and transformers for semantic image segmentation."},{"id":"lm2ibepo2","title":"Spatio-Temporal Transformers","pages":[{"author_id":6586453712175104,"collection_id":6479851841912832,"page_id":6428178922602496,"id":6428178922602496,"title":"Spatio-Temporal Transformers","is_preview":false,"parentIndex":"","editMode":false,"is_recovered":false,"type":"collection_lesson","can_edit":false,"is_standalone_module":false,"is_cloned":false,"brief_summary":"","slug":"spatio-temporal-transformers"},{"author_id":6586453712175104,"collection_id":6479851841912832,"page_id":5894575317188608,"id":5894575317188608,"title":"Quiz: Spatio-Temporal Transformers","is_preview":false,"parentIndex":"","editMode":false,"is_recovered":false,"type":"collection_lesson","can_edit":false,"is_standalone_module":false,"is_cloned":false,"brief_summary":"","slug":"quiz-spatio-temporal-transformers"}],"editMode":false,"type":"COLLECTION_CATEGORY","summary":"Build on the versatility of spatio-temporal transformers for advanced video analysis tasks."},{"id":6730443839504384,"title":"Object Detection with Vision Transformers","pages":[],"editMode":false,"type":"COLLECTION_PROJECT","slug":"object-detection-with-vision-transformers"},{"id":"0ourbm52n","title":"Wrap Up","pages":[{"author_id":6586453712175104,"collection_id":6479851841912832,"page_id":5072275044564992,"id":5072275044564992,"title":"Conclusion","is_preview":false,"parentIndex":"","editMode":false,"is_recovered":false,"type":"collection_lesson","can_edit":false,"is_standalone_module":false,"is_cloned":false,"brief_summary":"","slug":"conclusion"}],"editMode":false,"type":"COLLECTION_CATEGORY","summary":"Step through key concepts of transformers in computer vision and their practical applications."}]},"page_titles":{"4506676170063872":"Introduction to the Course","6159840717832192":"The Rise of Transformers","6296582315835392":"Inductive Bias in DNNs","6421025369358336":"Attention: General Deep Learning Idea","5234721103282176":"Attention in NLP","6514761212362752":"Is Attention All We Need?","4938705974067200":"Self-Attention Mechanism","4624813131563008":"Self-Attention Matrix Equations","5469206174498816":"Multihead Attention","6371961424576512":"Encoder-Decoder Attention","6332454513934336":"Transformers Pros and Cons","6165330994659328":"Unsupervised and Self-Supervised Pretraining","6314702933852160":"Introduction to Transformers in Computer Vision","5845856865222656":"Encoder-Decoder Design Pattern","5081521756831744":"Convolutional Encoders","4964586541023232":"Self-Attention vs. Convolution","6237284336402432":"Spatial vs. Channel vs. Temporal Attention","5577204024737792":"Local vs. Global Attention","5559101930864640":"Pros and Cons of Attention in CV","4740696550146048":"Image Classification with Vision Transformer (ViT and DeiT)","5878296753209344":"Shifter Window (Swin) Transformers","4618187708301312":"Object Detection Methods Review","6717336169742336":"DEtection TRansformers (DETR)","6251391324258304":"Image Segmentation Using ConvNets","5949525168226304":"Image Segmentation Using Transformers","6428178922602496":"Spatio-Temporal Transformers","5072275044564992":"Conclusion","6730443839504384":"Object Detection with Vision Transformers","6486290880790528":"Fine-Tuning Vision Transformers for Image Classification","4862119673724928":"Quiz: Transformers and Multihead Attention","4679595878121472":null,"5270519389749248":null,"5536863632883712":null,"6664445724065792":null,"4555722075537408":null,"4639051386847232":null,"5969434955087872":null,"6225641464791040":null,"4910942144036864":null,"5775199354093568":null,"4924467331596288":null,"4711170313420800":null,"4934947957768192":null,"5123991278845952":null,"5279278489010176":"Neural Machine Translation with a Transformer and Keras","5962224191537152":"Quiz: Attention in Computer Vision","6016449114144768":"Quiz: Attention and Inductive Bias","6205981189079040":"Quiz: Encoder-Decoder Architecture and Attention Mechanism in Computer Vision","5211288875696128":"Introduction to Transformers","5189798379782144":"Quiz: Transformers in Image Classification","5559032372920320":"Quiz: Transformers in Object Detection","5283825330552832":"Quiz: Transformers in Semantic Segmentation","5894575317188608":"Quiz: Spatio-Temporal Transformers","5604654588231680":null,"4700352631930880":null,"4515594840965120":"Task 0: Get Started","4730966160572416":"Task 0: Get Started","6337490154815488":null,"6222429331521536":null,"6725707710464000":"Task 2: Display the Last Context in the Dataset","5725280424558592":"Task 3: Display the Last Target in the Dataset","4517058149744640":null,"5403181633896448":null,"6453126327566336":null,"4790079338971136":"Task 1: Convert Batch of Strings to Token IDs (Context)","6421784961351680":"Task 2: Display Token IDs and Mask for Context","6328210743754752":"Task 1: Display Example Processed Data","6698517623078912":"Task 2: Encode Input Sequence","6202143462785024":"Task 1: Display Shapes of Sequences and Attention Results","6132807406583808":"Task 2: Display Attention Weights","6469473560297472":"Task 3: Display Attention Weights across Context Sequences","6682209598701568":"Task 4: Create a Decoder Instance","4935833625952256":"Task 1: Create a Translator Model Instance","5530308169564160":"Task 1: Compile the Translator Model","5202436556980224":"Task 2: Evaluate Model on Validation Data","6456045772865536":"Task 1: Translate the Example Sentence","5837637633048576":null,"5576941993328640":"Task 1: Import Libraries and Modules","5323548670427136":"Task 1: Load the Dataset","4564525012615168":"Task 2: Access the Label Information from the Training Split","6371475710935040":"Task 1: Extract Feature from the Pretrained Model","5891786500341760":"Task 2: Show the Configuration of Feature Extractor","5041305532104704":"Task 1: Preprocess a Training Example","4557594738950144":"Task 2: Visualizing Preprocessed Training Examples","5311633390960640":"Task 1: Import the Accuracy Metric","6481185407631360":null,"5688842962206720":"Task 1: Import Libraries and Module","6238369129562112":"Task 1: List and Sort Paths to Images and Annotations","6458747055636480":null,"5859014668320768":"Task 1: Visualize the First Image in the Training Set","5460638196432896":null,"5472425868394496":"Task 3: Visualize Patches from the First Image","4933704632369152":"Task 1: Calculate the Number of Patches","5727823214542848":"Task 2: Train ViT Object Detector and Store Training History","5488751174877184":"Task 3: Plot Training and Validation Loss","5061246102142976":"Task 1: Save ViT Object Detector Model","5727309110312960":null},"page_tags":{"4506676170063872":"","6159840717832192":"Transformers,Attention Mechanisms,Computer Vision","6296582315835392":"Inductive Bias,DNNs","6421025369358336":"Attention Mechanisms,Dynamic Weights,NLP Applications","5234721103282176":"Attention Mechanisms,Transformers,NMT","6514761212362752":"Attention Mechanisms,Sequential Processing,Transformer Models","4938705974067200":"Self Attention,Attention Maps","4624813131563008":"Self-attention,Attention mechanism","5469206174498816":"Multi-head attention,Interpretability","6371961424576512":"EncoderDecoder attention,Transformer architecture,SelfAttention mechanisms","6332454513934336":"Scalability,Parallelism,Pros&Cons","6165330994659328":"Pre-training,Fine-tuning","4862119673724928":"","4679595878121472":"","5270519389749248":"","5536863632883712":"","6664445724065792":"","4555722075537408":"","4639051386847232":"","5969434955087872":"","6225641464791040":"","4910942144036864":"","5775199354093568":"","4924467331596288":"","4711170313420800":"","4934947957768192":"","5123991278845952":"","5279278489010176":"","6314702933852160":"Computer Vision,Self-Attention","5962224191537152":"","5845856865222656":"Analysis & Synthesis,Computer Vision,Transformers","5081521756831744":"Convolution Encoders,Feature Extraction,Computer Vision","4964586541023232":"Self-Attention,Global Relationships,Feature Detection","6237284336402432":"Attention mechanisms,Feature maps","5559101930864640":"Attention Mechanisms,Convolution Operations","5577204024737792":"Local Attention,Criss-Cross Attention,Attention Mechanisms","6016449114144768":"","6205981189079040":"","5765478672891904":"","5211288875696128":"Transformers,NLP","4740696550146048":"Image Classification,DeiT,ViT","5189798379782144":"","5878296753209344":"SWIN,Image Classification","4618187708301312":"Object Detection,YOLO","5559032372920320":"","5283825330552832":"","5894575317188608":"","6717336169742336":"DETR,Object Detection","6251391324258304":"Image Segmentation","5949525168226304":"Semantic Segmentation,SETR,Segmenter","6428178922602496":"Spatio-Temporal,Video Transformer","5072275044564992":"","5604654588231680":"","4700352631930880":"","4837863031177216":"","6486290880790528":"","4515594840965120":"","6730443839504384":"","4730966160572416":"","6337490154815488":"","6222429331521536":"","6725707710464000":"","5725280424558592":"","4517058149744640":"","5403181633896448":"","6453126327566336":"","4790079338971136":"","6421784961351680":"","6328210743754752":"","6698517623078912":"","6202143462785024":"","6132807406583808":"","6469473560297472":"","6682209598701568":"","4935833625952256":"","5530308169564160":"","5202436556980224":"","6456045772865536":"","5837637633048576":"","5576941993328640":"","5323548670427136":"","4564525012615168":"","6371475710935040":"","5891786500341760":"","5041305532104704":"","4557594738950144":"","5311633390960640":"","6481185407631360":"","5688842962206720":"","6238369129562112":"","6458747055636480":"","5859014668320768":"","5460638196432896":"","5472425868394496":"","4933704632369152":"","5727823214542848":"","5488751174877184":"","5061246102142976":"","5727309110312960":"","5401168493805568":""},"collection_toc_is_enabled":true,"page_count":null,"docker":{"container":{"file":{"name":"cv_updated.tar.gz","size":948124},"imageName":"author-6586453712175104-collection-6479851841912832-rev-37-container-5732224581894144-cv_updated","buildStatus":"SUCCESS","buildStatusUrl":"/api/author/6586453712175104/collection/6479851841912832/containers/5732224581894144/build/status","buildLogUrl":"/api/author/6586453712175104/collection/6479851841912832/containers/5732224581894144/build/log","metadata":{"sizeInBytes":948124},"id":-1,"tarballDownloadUrl":"/api/author/6586453712175104/collection/6479851841912832/containers/5732224581894144/download","rebuildImageUrl":"/api/author/6586453712175104/collection/6479851841912832/containers/5732224581894144/rebuild","track":false},"envs":[],"jobs":[{"key":"15fXdSn54ny0qtxd2Zp_n","jobType":"Live","name":"DETR","inputFileName":"foo","runScript":"nohup jupyter notebook /usr/local/notebooks/Detection_Transformer_with_DETR.ipynb --allow-root --no-browser > /dev/null 2>&1 &","ports":"8080","startScript":"echo \"hello\"","runInLiveContainer":true,"forceRelaunchOnRun":true},{"key":"__jP-Iir979Ae3Ye88lXY","jobType":"Live","name":"Vit","inputFileName":"foo","runScript":"nohup jupyter notebook /usr/local/notebooks/Image_Classification_with_ViT.ipynb --allow-root --no-browser > /dev/null 2>&1 &","ports":"8080","startScript":"echo \"hello\"","runInLiveContainer":true,"forceRelaunchOnRun":true},{"key":"g9kLLWBHcmJGeEvOV3C4p","jobType":"Live","name":"ODM","inputFileName":"foo","runScript":"nohup jupyter notebook /usr/local/notebooks/Object_Detection_with_Yolo.ipynb --allow-root --no-browser > /dev/null 2>&1 &","ports":"8080","startScript":"echo \"hello\"","runInLiveContainer":true,"forceRelaunchOnRun":true},{"key":"0TXgwFe--d7NaQ1o2H5X6","jobType":"Live","name":"CNN","inputFileName":"foo","runScript":"nohup jupyter notebook /usr/local/notebooks/Image_Segmentation_using_CNN.ipynb --allow-root --no-browser > /dev/null 2>&1 &","ports":"8080","startScript":"echo \"hello\"","runInLiveContainer":true,"forceRelaunchOnRun":true},{"key":"x94LUWHo5w9Xsh5NkHCSj","jobType":"Live","name":"Transformers","inputFileName":"foo","runScript":"nohup jupyter notebook /usr/local/notebooks/ImageSegmentationTransformers.ipynb --allow-root --no-browser > /dev/null 2>&1 &","ports":"8080","startScript":"echo \"hello\"","runInLiveContainer":true,"forceRelaunchOnRun":true},{"key":"zcuoG9sqcmhVGOB2jsXjR","jobType":"Live","name":"DETR2","inputFileName":"foo","runScript":"nohup jupyter notebook /usr/local/notebooks/Detection_Transformer_with_DETR.ipynb --allow-root --no-browser > /dev/null 2>&1 &","ports":"8080","startScript":"echo \"hello\"","runInLiveContainer":true,"forceRelaunchOnRun":true}],"testRunners":[],"version":3,"loaded":true},"discounted_price":24,"cover_image_id":5004325450022912,"cover_image_metadata":"{\"width\":1024,\"height\":512,\"sizeInBytes\":29499,\"name\":\"Transformers in Computer Vision (1).png\"}","cover_image_serving_url":"/v2api/collection/6586453712175104/6479851841912832/image/5004325450022912","tags":[],"intro_video_url":"","intro_video_thumbnail_url":"","aggregated_widget_stats":{"projects":3,"assessments":0,"cloudlabs":0,"SlateHTML":169,"codeExerciseCount":0,"codeRunnableCount":24,"codeSnippetCount":5,"illustrations":95,"Quiz":8,"Image":0,"CanvasAnimation":0,"DrawIOWidget":94,"Table":7,"Code":24,"MarkMap":1,"Latex":2,"LiveApp":5},"default_themes":{"code_themes":{"Code":"default","Markdown":"default","RunJS":"default","SPA":"default","isForced":{"Code":false,"Markdown":false,"RunJS":false,"SPA":false}}},"api_keys":{"api_keys":[]},"skills":[],"testimonials":[],"licensing":null,"target_audience":"advanced","author_id":"6586453712175104","collection_id":"6479851841912832","approval_status":3005,"price":24,"is_private":false,"path_type":"regular","organization_id":null,"is_mini":false,"is_priced":true,"brief_summary":"Learn about transformer networks, self-attention, multi-head attention, and spatiotemporal transformers in this course, focusing on their applications in computer vision and deep learning.","approval_update_time":"2024-08-13T17:15:48.007Z","rating_visibility":true,"update_last_published_on_homepage":true,"show_developed_by":true,"udata_files":[],"CodeThemes":{"Code":"default","Markdown":"default","RunJS":"default","SPA":"default","isForced":{"Code":false,"Markdown":false,"RunJS":false,"SPA":false}},"is_marked_for_deletion":false,"transition_page_title":"","is_redirectable":false,"collection_type":"collection","adaptive_learning_mode":false,"HLOs_to_toc":{},"is_guide":false,"read_time":18000,"allow_logged_out_executions":false,"unique_live_widget_urls":false,"metadata_status":101},"pageSummarySSR":{"title":"Quiz: Spatio-Temporal Transformers","description":"Test your understanding of transformer applications in video analysis.","discourse_page_url":"https://discuss.educative.io/tag/quiz-spatio-temporal-transformers__spatio-temporal-transformers__transformers-for-computer-vision-applications?open=true&ctag=transformers-for-computer-vision-applications__ammar-mohanna&cslug=vision-transformers&pslug=quiz-spatio-temporal-transformers"},"adaptiveLearningConfigConstantSSR":0,"enableLessonPageLockedBannerV2":true,"allowAllLessonPreview":false,"lockedBannerStatsSSR":{"b2cTrialStats":{"is_b2c_trial_active":true,"b2c_trial_active_duration":7,"b2c_trial_categories":"$a5"},"b2cStatus":100,"learnerTags":"$a6","workStats":1430,"interviewWorksStats":76,"inL2cStarterPack":false,"l2cWorkStats":38,"enableL2cStarterPackPaymentWidget":"true"},"pageTocSSR":"","authorId":"6586453712175104","collectionId":"6479851841912832","pageId":"5894575317188608","isCollectionPageLockedCachingEnabled":true,"aceFeatureFlags":{"enableAceEditor":true,"enableAceEditorForAnswers":true},"meta":{"type":["Article","TechArticle"],"title":"Quiz: Spatio-Temporal Transformers","name":"Transformers for Computer Vision Applications","description":"Test your understanding of transformer applications in video analysis.","image":"https://educative.io/api/collection/6586453712175104/6479851841912832/image/5004325450022912.png","isAccessibleForFree":false,"keywords":"$a6","provider":"Educative","publisher":"Educative","id":"courses/vision-transformers/quiz-spatio-temporal-transformers","author":"Educative","educationalLevel":"advanced","noIndex":false,"isForcedNoIndex":false,"noFollow":false,"redirectInfo":{"isDeletedCollectionPageRedirectable":false},"page_titles":"$a7","is_marked_for_deletion":false,"transition_page_title":"","is_redirectable":false,"deleted_course_lesson_redirect":{"author_id":null,"collection_id":null,"page_id":null,"redirect_url_slug":null},"metadata_status":101,"additional_course_alternatives":[]},"requestUrl":"/courses/vision-transformers/quiz-spatio-temporal-transformers","requestUrlInfo":{"authorId":"6586453712175104","collectionId":"6479851841912832","pageId":"5894575317188608","courseUrlSlug":"vision-transformers","pageUrlSlug":"quiz-spatio-temporal-transformers"},"isExternalContent":false}}],[["$","script",null,{"id":"generate-data","type":"application/ld+json","dangerouslySetInnerHTML":{"__html":"$a8"}}],false,"$undefined"]]