Metadata Treatment
Explore how to handle PDF metadata by extracting, updating, and deleting document information using Python libraries PyPDF4 and Pikepdf. This lesson offers practical code examples and scenarios to help you manipulate both DID and XMP metadata effectively within PDF files.
Introduction
Metadata is typically populated by PDF conversion applications. It encloses relatively common fields showing the document version, creation date, and creation program, among others. Some overlooked attributes merit a closer look in case you want to dive into PDF analysis.
Scope
The objective of this lesson is to show how to extract, update, and delete the metadata of a PDF file using the Python programming language.
Prerequisites
We need two libraries for metadata manipulation:
PyPDF4
It is a pure-python PDF library best suited to split, merge, crop, and transform the pages of a PDF file. Additionally, it can retrieve text and metadata from PDFs.
Pikepdf
It is a library intended for developers to create, manipulate, and parse the PDF format. It supports reading and writing PDFs, including creating from scratch.
| Library | Version |
|---|---|
| PyPDF4 | 1.27.0 |
| Pikepdf | 3.0.0 |
The Pikepdf library allows PDF XMP metadata editing in contrast to the PyPDF4 library. Therefore, we will leverage its capabilities during this lesson.
Let’s start coding
By harnessing the capabilities of the PyPDF4 library, we will define the functions collect_did_metadata, update_did_metadata and collect_xmp_metadata.
Next, we will rely on the PikePDF library to develop the functions modify_metadata and delete_metadata.
Afterward, we will utilize these functions in different scenarios to manipulate the metadata of sample PDF files.
Let’s see what that looks like in code:
We defined the function collect_did_metadata to extract the document information dictionary from a pre-selected PDF file (Line 8). This function will be based on the exhaustive list of DID attributes already specified (Lines 10-14).
Using the PyPDF4 library, we defined the function update_did_metadata. This function updates the DID metadata attributes based on the variable metadata_dict specified as a parameter. Let us discuss it in further detail:
- We initialized a
PdfFileMergerobject (Line 7). - We added custom metadata to this object (Lines 10-19).
- We saved this object to an output file (Line 21).
We defined the function collect_xmp_metadata
intended to extract the XMP metadata (Line 8) based on the exhaustive collection of Extensible Metadata Platform attributes already drawn up (Lines 10-14).
By leveraging the Python library Pikepdf we added two functions:
-
modify_metadatato handle the modification to the XMP metadata attributes. This function accepts a parameter calledmetadata_dictholding the name of the attribute to modify and its new value. It opens a PDF document (Line 7), loops throughout the dictionary of its metadata attributes (Lines 8-9), and replaces the values of the attributes to modify (Line 10). -
delete_metadatato handle the deletion of the XMP metadata attributes. This function accepts a parameter namedmetadata_list, containing the names of the attributes to remove. It works similarly to the previous function, but deletes the attribute instead of replacing its value (Line 23).
Let’s try our utility
Here we will address common test scenarios:
Scenario 1: Collecting the DID metadata attributes
This scenario describes how to collect the DID metadata attributes.
We will extract the DID metadata attributes of a sample PDF file called Predict_Emotions_v1.pdf, then look into these attributes in Adobe Acrobat Reader and compare their values with the gathered ones.
Execute the following code snippet and check the collected DID metadata attributes:
The following figure, extracted using Adobe Acrobat Reader, exhibits the collected DID metadata attributes:
Scenario 2: Updating the DID metadata attributes
This scenario shows how to update the DID metadata attributes.
We will modify the DID metadata attributes of a sample PDF file called Predict_Emotions_v1.pdf and we will save the updated instance of this PDF to a new file called Predict_Emotions_v1_meta.pdf.
Next, we will collect the DID metadata attributes from the resulting PDF document and compare their values to those extracted using Adobe Acrobat Reader.
Execute the following code snippet, and visualize the results:
The following figure, extracted using Adobe Acrobat Reader, exhibits the updated DID attributes.
Please refer to the modified items which are highlighted in red boxes.
Scenario 3: Managing the XMP metadata attributes
This scenario outlines multiple use cases related to the Extensible Metadata Platform XMP attributes, and applies the, on a sample file called PDF2.pdf.
Execute the following code snippet covering the following:
- Collecting the XMP attributes.
- Updating some XMP attributes.
- Deleting some XMP attributes.
Please refer to the affected items highlighted in red boxes for additional details.
The next figures, extracted using Adobe Acrobat Reader, exhibit the use cases outlined previously:
Now that we’re armed with all this information about metadata manipulation, you can try to change these code snippets and develop your custom scenarios!
Conclusion
We have walked through the routines needed to gather and manipulate the PDF metadata, while leveraging the capabilities of the PyPDF4 and Pikepdf Python libraries.