How to Extract Images from PDF

Learn how to extract the images from a PDF document, while exploiting PyMuPDF and Pillow libraries.

Introduction

The PDF file format encloses disparate types of content which includes text, images, and other multimedia elements.

Parsing a PDF document and extracting images from it is not a straightforward task, but Python will help us to accomplish this.

How images are stored in a PDF file

Generally, an image is stored in a PDF file as a separate object called XObject. This object contains the image raw binary data, including its pixels, color-space, and other related information.

It is worth mentioning that the storage of images in a PDF file may change depending on the PDF creation tools.

The following figure shows the image objects included within the cross-reference table of a sample PDF file:

Get hands-on with 1200+ tech skills courses.