Transformers Applied to Computer Vision

Learn about how transformers are applied to computer vision.


This course is about NLP, not computer vision. However, in the previous lessons of this chapter, we implemented general-purpose sequences that can be applied to many domains. Computer vision is one of them.

The title of the article by Dosovitskiy et al. (2021) says it all: “An image is worth 16x16 words: Transformers for Image Recognition at Scale.The paper can be accessed at:” The authors processed an image as sequences. The results proved their point.

Google has made vision transformers available in a Jupyter notebook.

The Jupyter notebook Compact_Convolutional_Transformers.ipynb (under the “Code playground” section) is self-explanatory. You can explore it to see how it works. However, bear in mind that when Industry 4.0 reaches maturity and Industry 5.0 kicks in, the best implementations will be obtained by integrating our data on Cloud AI platforms. Local development will diminish, and companies will turn to Cloud AI without bearing local development, maintenance, and support.

Some code contents

The notebook’s table of contents contains a transformer process we have gone through several times in this course. However, this time, it’s simply applied to sequences of digital image information.

The notebook follows standard deep learning methods. It shows some images with labels with this code:

Get hands-on with 1200+ tech skills courses.