Search⌘ K
AI Features

Metadata Treatment

Explore how to handle PDF metadata by extracting, updating, and deleting document information using Python libraries PyPDF4 and Pikepdf. This lesson offers practical code examples and scenarios to help you manipulate both DID and XMP metadata effectively within PDF files.

Introduction

Metadata is typically populated by PDF conversion applications. It encloses relatively common fields showing the document version, creation date, and creation program, among others. Some overlooked attributes merit a closer look in case you want to dive into PDF analysis.

Scope

The objective of this lesson is to show how to extract, update, and delete the metadata of a PDF file using the Python programming language.

Prerequisites

We need two libraries for metadata manipulation:

PyPDF4

It is a pure-python PDF library best suited to split, merge, crop, and transform the pages of a PDF file. Additionally, it can retrieve text and metadata from PDFs.

Pikepdf

It is a library intended for developers to create, manipulate, and parse the PDF format. It supports reading and writing PDFs, including creating from scratch.

Library Version
PyPDF4 1.27.0
Pikepdf 3.0.0

The Pikepdf library allows PDF XMP metadata editing in contrast to the PyPDF4 library. Therefore, we will leverage its capabilities during this lesson.

Let’s start coding

By harnessing the capabilities of the PyPDF4 library, we will define the functions collect_did_metadata, update_did_metadata and collect_xmp_metadata.

Next, we will rely on the PikePDF library to develop the functions modify_metadata and delete_metadata.

Afterward, we will utilize these functions in different scenarios to manipulate the metadata of sample PDF files.

Let’s see what that looks like in code:

We defined the function collect_did_metadata to extract the document information dictionary from a pre-selected PDF file (Line 8). This function will be based on the exhaustive list of DID attributes already specified (Lines 10-14).

Python 3.5
def collect_did_metadata(input_file:str):
"""
Collect Document Information Dictionary metadata
"""
#Initializes a PdfFileReader object
pdf_reader = PdfFileReader(input_file)
# Create an object containing the Document Information metadata
did_metadata = pdf_reader.getDocumentInfo()
did = {}
for i in DID_ATTRIBUTES:
try:
did[i] = did_metadata.get(i)
except:
did[i] = ''
return did

Using the PyPDF4 library, we defined the function update_did_metadata. This function updates the DID metadata attributes based on the variable metadata_dict specified as a parameter. Let us discuss it in further detail:

  1. We initialized a PdfFileMerger object (Line 7).
  2. We added custom metadata to this object (Lines 10-19).
  3. We saved this object to an output file (Line 21).
Python 3.5
def update_did_metadata(input_file: str
, output_file: str
, metadata_dict: dict):
"""
Update the Document Information Dictionary metadata
"""
pdf_merger = PdfFileMerger()
pdf_merger.append(fileobj=open(input_file, 'rb'))
pdf_merger.addMetadata({
'/Author': metadata_dict.get('Author')
, '/Subject': metadata_dict.get('Subject')
, '/Title': metadata_dict.get('Title')
, '/Keywords': metadata_dict.get('Keywords')
, '/Producer': metadata_dict.get('Producer')
, '/Creator': metadata_dict.get('Creator')
, '/CreationDate': metadata_dict.get('CreationDate') # Date in ISO Format
, '/ModDate': metadata_dict.get('ModDate') # Date in ISO Format
})
pdf_out = open(output_file, 'wb')
pdf_merger.write(pdf_out)
pdf_out.close()
pdf_merger.close()

We defined the function collect_xmp_metadata intended to extract the XMP metadata (Line 8) based on the exhaustive collection of Extensible Metadata Platform attributes already drawn up (Lines 10-14).

Python 3.5
def collect_xmp_metadata(input_file:str):
"""
Collect Extensible Metadata Platform metadata
"""
# Initializes a PdfFileReader object
pdf_reader = PdfFileReader(input_file)
# Create an object containing the extractable Extensible Metadata Platform metadata
xmp_metadata = pdf_reader.getXmpMetadata()
xmp = {}
for i in xmp_attributes:
try:
xmp[i] = getattr(xmp_metadata,i)
except:
xmp[i] = ''
return xmp

By leveraging the Python library Pikepdf we added two functions:

  • modify_metadata to handle the modification to the XMP metadata attributes. This function accepts a parameter called metadata_dict holding the name of the attribute to modify and its new value. It opens a PDF document (Line 7), loops throughout the dictionary of its metadata attributes (Lines 8-9), and replaces the values of the attributes to modify (Line 10).

  • delete_metadata to handle the deletion of the XMP metadata attributes. This function accepts a parameter named metadata_list, containing the names of the attributes to remove. It works similarly to the previous function, but deletes the attribute instead of replacing its value (Line 23).

Python 3.5
def modify_metadata(input_file:str
,output_file:str
,metadata_dict:dict):
"""
Change the metadata
"""
with pikepdf.open(input_file) as pdf_in:
with pdf_in.open_metadata(set_pikepdf_as_editor=False,update_docinfo=True) as meta:
for key, val in metadata_dict.items():
meta[key] = val
pdf_in.save(output_file)
def delete_metadata(input_file:str
,output_file:str
,metadata_list:list):
"""
Delete the list of metadata elements
"""
with pikepdf.open(input_file) as pdf_in:
with pdf_in.open_metadata(set_pikepdf_as_editor=False,update_docinfo=True) as meta:
for i in metadata_list:
del meta[i]
pdf_in.save(output_file)

Let’s try our utility

Here we will address common test scenarios:

Scenario 1: Collecting the DID metadata attributes

This scenario describes how to collect the DID metadata attributes.

We will extract the DID metadata attributes of a sample PDF file called Predict_Emotions_v1.pdf, then look into these attributes in Adobe Acrobat Reader and compare their values with the gathered ones.

Execute the following code snippet and check the collected DID metadata attributes:

Python 3.5
# Import libraries
from PyPDF4 import PdfFileReader, PdfFileMerger
import os,subprocess
# Exhaustive collection of Document Information Dictionary attributes
DID_ATTRIBUTES = ['/Title', '/Author', '/Subject', '/Keywords', '/Creator', '/Producer', '/CreationDate'
, '/ModDate', '/Trapped']
def collect_did_metadata(input_file: str):
"""
Collect Document Information Dictionary metadata
"""
# Initializes a PdfFileReader object
pdf_reader = PdfFileReader(input_file)
# Create an object containing the Document Information metadata
did_metadata = pdf_reader.getDocumentInfo()
did = {}
for i in DID_ATTRIBUTES:
try:
did[i] = did_metadata.get(i)
except:
did[i] = ''
return did
def display_metadata(metadata_dict: dict):
"""
Display Metadata Info
"""
print("#" * 20)
print("\n".join("{}:{}".format(i, j) for i, j in metadata_dict.items()))
print("#" * 20)
if __name__ == "__main__":
# Move to the project directory
os.chdir('/usr/src/mypdftoolbox')
# Specify a sample PDF file and its corresponding path
test_pdf = 'Predict_Emotions_v1.pdf'
test_pdf_path = os.path.join('./static',test_pdf)
# Collect the Document Information Dictionary metadata
did_dict = collect_did_metadata(input_file=test_pdf_path)
# Display the Document Information Dictionary metadata
print('List of Document Information Dictionary metadata...')
display_metadata(did_dict)
#Dowload the sample PDF for validation
download_path = os.path.join('/usercode/output',test_pdf)
subprocess.call(["mv",test_pdf_path,download_path])

The following figure, extracted using Adobe Acrobat Reader, exhibits the collected DID metadata attributes:

Scenario 2: Updating the DID metadata attributes

This scenario shows how to update the DID metadata attributes.

We will modify the DID metadata attributes of a sample PDF file called Predict_Emotions_v1.pdf and we will save the updated instance of this PDF to a new file called Predict_Emotions_v1_meta.pdf. Next, we will collect the DID metadata attributes from the resulting PDF document and compare their values to those extracted using Adobe Acrobat Reader.

Execute the following code snippet, and visualize the results:

Python 3.5
# Import libraries
from PyPDF4 import PdfFileReader, PdfFileMerger
import os,subprocess,time
# Exhaustive collection of Document Information Dictionary attributes
DID_ATTRIBUTES = ['/Title', '/Author', '/Subject', '/Keywords', '/Creator', '/Producer', '/CreationDate'
, '/ModDate', '/Trapped']
def collect_did_metadata(input_file: str):
"""
Collect Document Information Dictionary metadata
"""
# Initializes a PdfFileReader object
pdf_reader = PdfFileReader(input_file)
# Create an object containing the Document Information metadata
did_metadata = pdf_reader.getDocumentInfo()
did = {}
for i in DID_ATTRIBUTES:
try:
did[i] = did_metadata.get(i)
except:
did[i] = ''
return did
def display_metadata(metadata_dict: dict):
"""
Display Metadata Info
"""
print("#" * 20)
print("\n".join("{}:{}".format(i, j) for i, j in metadata_dict.items()))
print("#" * 20)
def update_did_metadata(input_file: str
, output_file: str
, metadata_dict: dict):
"""
Update the Document Information Dictionary metadata
"""
pdf_merger = PdfFileMerger()
pdf_merger.append(fileobj=open(input_file, 'rb'))
pdf_merger.addMetadata({
'/Author': metadata_dict.get('Author')
, '/Subject': metadata_dict.get('Subject')
, '/Title': metadata_dict.get('Title')
, '/Keywords': metadata_dict.get('Keywords')
, '/Producer': metadata_dict.get('Producer')
, '/Creator': metadata_dict.get('Creator')
, '/CreationDate': metadata_dict.get('CreationDate') # Date in ISO Format
, '/ModDate': metadata_dict.get('ModDate') # Date in ISO Format
})
pdf_out = open(output_file, 'wb')
pdf_merger.write(pdf_out)
pdf_out.close()
pdf_merger.close()
if __name__ == "__main__":
# Move to the project directory
os.chdir('/usr/src/mypdftoolbox')
# Specify the sample PDF file and its updated version
test_pdf = 'Predict_Emotions_v1.pdf'
modified_pdf = 'Predict_Emotions_v1_meta.pdf'
metadata_dict = {
'Author': 'Author Educative'
, 'Subject': 'Subject Educative'
, 'Title': 'Title Educative'
, 'Keywords': 'Keywords Educative'
, 'Producer': 'Producer Educative'
, 'Creator': 'Creator Educative'
, 'CreationDate': 'D:20210901120400Z'
, 'ModDate': 'D:20210901120400Z'
}
# Update the Document Information Dictionary metadata with these attributes
print("Updating the Document Information Dictionary metadata...")
update_did_metadata(input_file=os.path.join('./static',test_pdf)
, output_file=os.path.join('./static',modified_pdf )
, metadata_dict=metadata_dict)
# Collect the Document Information Dictionary metadata
did_dict = collect_did_metadata(input_file=os.path.join('./static',modified_pdf ))
# Display the Document Information Dictionary metadata
print("List of the updated Document Information Dictionary metadata")
display_metadata(did_dict)
#Wait few seconds
time.sleep(3)
#Dowload the PDF for validation
download_path = os.path.join('/usercode/output',modified_pdf)
subprocess.call(["mv",os.path.join('./static',modified_pdf ),download_path])

The following figure, extracted using Adobe Acrobat Reader, exhibits the updated DID attributes.

Please refer to the modified items which are highlighted in red boxes.

Scenario 3: Managing the XMP metadata attributes

This scenario outlines multiple use cases related to the Extensible Metadata Platform XMP attributes, and applies the, on a sample file called PDF2.pdf.

Execute the following code snippet covering the following:

  • Collecting the XMP attributes.
  • Updating some XMP attributes.
  • Deleting some XMP attributes.

Please refer to the affected items highlighted in red boxes for additional details.

Python 3.5
# Import libraries
from PyPDF4 import PdfFileReader, PdfFileMerger
import pikepdf
import os,subprocess,time
# Exhaustive collection of Extensible Metadata Platform attributes
xmp_attributes = ['custom_properties', 'dc_contributor', 'dc_coverage', 'dc_creator', 'dc_date', 'dc_description'
, 'dc_format', 'dc_identifier', 'dc_language', 'dc_publisher', 'dc_relation', 'dc_rights'
, 'dc_source', 'dc_subject', 'dc_title', 'dc_type', 'pdf_keywords', 'pdf_pdfversion'
, 'pdf_producer', 'xmp_createDate', 'xmp_creatorTool', 'xmp_metadataDate', 'xmp_modifyDate'
, 'xmpmm_documentId', 'xmpmm_instanceId']
def collect_xmp_metadata(input_file: str):
"""
Collect Extensible Metadata Platform metadata
"""
# Initializes a PdfFileReader object
pdf_reader = PdfFileReader(input_file)
# Create an object containing the extractable Extensible Metadata Platform metadata
xmp_metadata = pdf_reader.getXmpMetadata()
xmp = {}
for i in xmp_attributes:
try:
xmp[i] = getattr(xmp_metadata, i)
except:
xmp[i] = ''
return xmp
def display_metadata(metadata_dict: dict):
"""
Display Metadata Info
"""
print("#" * 50)
print("\n".join("{}:{}".format(i, j) for i, j in metadata_dict.items()))
print("#" * 50)
def modify_metadata(input_file: str
, output_file: str
, metadata_dict: dict):
"""
Change the metadata
"""
with pikepdf.open(input_file) as pdf_in:
with pdf_in.open_metadata(set_pikepdf_as_editor=False, update_docinfo=True) as meta:
for key, val in metadata_dict.items():
meta[key] = val
pdf_in.save(output_file)
def delete_metadata(input_file: str
, output_file: str
, metadata_list: list):
"""
Delete the list of metadata elements
"""
with pikepdf.open(input_file) as pdf_in:
with pdf_in.open_metadata(set_pikepdf_as_editor=False, update_docinfo=True) as meta:
for i in metadata_list:
del meta[i]
pdf_in.save(output_file)
if __name__ == "__main__":
# Move to the project directory
os.chdir('/usr/src/mypdftoolbox')
#Sample PDF version 2.0
test_pdf = 'PDF2.pdf'
upd_meta_pdf = 'PDF2_meta.pdf'
del_meta_pdf = 'PDF2_meta_del.pdf'
# Collect the Extensible Metadata Platform metadata attri0utes
print("Collecting the preset XMP attributes...")
# Collect the Extensible Metadata Platform metadata attri0utes
xmp_dict = collect_xmp_metadata(input_file=os.path.join('./static',test_pdf))
# Display the Extensible Metadata Platform metadata attributes
display_metadata(xmp_dict)
print("Updating the preset XMP attributes...")
# Update the PDF metadata attributes
metadata_dict = {
'dc:title': 'Educative Title'
, 'dc:creator': ['Educative Creator']
, 'pdf:Producer': 'Educative Producer'
, 'xmp:CreatorTool': 'Creator Tool'
}
modify_metadata(input_file=os.path.join('./static',test_pdf)
, output_file=os.path.join('./static',upd_meta_pdf)
, metadata_dict=metadata_dict
)
# Collect the Extensible Metadata Platform metadata attri0utes
print("Collecting the XMP attributes after update...")
xmp_dict = collect_xmp_metadata(input_file=os.path.join('./static',upd_meta_pdf))
# Display the Extensible Metadata Platform metadata attributes
display_metadata(xmp_dict)
# Delete specific metadata attributes
metadata_list = ['dc:title', 'xmp:CreatorTool']
delete_metadata(input_file=os.path.join('./static',upd_meta_pdf)
, output_file=os.path.join('./static',del_meta_pdf)
, metadata_list=metadata_list)
print("Collecting the XMP attributes after deletion...")
# Collect the Extensible Metadata Platform metadata attri0utes
xmp_dict = collect_xmp_metadata(input_file=os.path.join('./static',del_meta_pdf))
# Display the Extensible Metadata Platform metadata attributes
display_metadata(xmp_dict)
#Wait few seconds
time.sleep(3)
#Dowload the PDF for validation
download_path = os.path.join('/usercode/output',upd_meta_pdf)
subprocess.call(["mv",os.path.join('./static',upd_meta_pdf ),download_path])
download_path = os.path.join('/usercode/output',del_meta_pdf)
subprocess.call(["mv",os.path.join('./static',del_meta_pdf ),download_path])

The next figures, extracted using Adobe Acrobat Reader, exhibit the use cases outlined previously:

Now that we’re armed with all this information about metadata manipulation, you can try to change these code snippets and develop your custom scenarios!

Conclusion

We have walked through the routines needed to gather and manipulate the PDF metadata, while leveraging the capabilities of the PyPDF4 and Pikepdf Python libraries.