Introduction

Metadata is typically populated by PDF conversion applications. It encloses relatively common fields showing the document version, creation date, and creation program, among others. Some overlooked attributes merit a closer look in case you want to dive into PDF analysis.

Scope

The objective of this lesson is to show how to extract, update, and delete the metadata of a PDF file using the Python programming language.

Prerequisites

We need two libraries for metadata manipulation:

PyPDF4

It is a pure-python PDF library best suited to split, merge, crop, and transform the pages of a PDF file. Additionally, it can retrieve text and metadata from PDFs.

Pikepdf

It is a library intended for developers to create, manipulate, and parse the PDF format. It supports reading and writing PDFs, including creating from scratch.

Library Version
PyPDF4 1.27.0
Pikepdf 3.0.0

The Pikepdf library allows PDF XMP metadata editing in contrast to the PyPDF4 library. Therefore, we will leverage its capabilities during this lesson.

Let’s start coding

By harnessing the capabilities of the PyPDF4 library, we will define the functions collect_did_metadata, update_did_metadata and collect_xmp_metadata.

Next, we will rely on the PikePDF library to develop the functions modify_metadata and delete_metadata.

Afterward, we will utilize these functions in different scenarios to manipulate the metadata of sample PDF files.

Let’s see what that looks like in code:

Create a free account to view this lesson.

By signing up, you agree to Educative's Terms of Service and Privacy Policy