Introduction

A PDF document is intrinsically rich in metadata artifacts, which can be valuable information during a digital forensic investigation. While there are multiple ways to extract this metadata from a PDF file, such techniques are either manual processes or do not encompass all the metadata artifacts.

What is metadata?

Simply put, metadata is defined as data, about data. Generally, the metadata of digital objects is divided into two categories:

  • The file system metadata

The file system metadata refers to data elements that are related to the hosting file itself and do not participate in the byte-sequence that constitutes the file’s binary structure.

  • The application metadata

The application metadata deals with the elements that are intrinsic to the file and participate in the binary’s byte-sequence.

For the sake of this course, we will emphasize the manipulation of the application metadata.

Application metadata types

The application metadata is stored within a PDF file as either a document information dictionary object or a metadata stream object.

The document information dictionary (DID) metadata has been part of the PDF since version 1.0. They cover general information about a PDF file by combining pairs of data objects consisting of a key and a matching value. The metadata streams available since PDF 1.4 (2001) are viewed as an elaborated mechanism for embedding more comprehensive metadata attributes in a PDF document. The contents of the metadata stream are represented in Extensible Markup Language (XML), and may include metadata for the entire PDF, and specific components within it.

  • The document information dictionary

There are nine attributes associated with the DID objects, listed below:

Key Name Data Type Value Description
/Title Text Title of the Document
/Author Text Author of the Document
/Subject Text Subject of the Document
/Keywords Text Keywords Linked to the Document
/Creator Text Application Originally Used to Create the PDF Document
/Producer Text Application Originally Used to Convert the PDF Document
/CreationDate Date Date and Time of Document’s Creation
/ModDate Date Date and Time of Document’s Last Modification
/Trapped Name Object Indicates If the Document Has Been Modified to Include Trapping Information

The attributes of the DID follow a syntax rule and their keys are referenced as “/Key Name.” These attributes are typically contained within an object called the trailer.

  • The metadata streams

Parsing the metadata streams returns an object represented in a subset of the XML called the Extensible Metadata Platform (XMP). The resulting object includes twenty-five possible metadata values that are given in the following table:

Key Name Type Schema Description
custom_properties Dictionary Custom schema properties Custom Metadata Properties
dc_contributor List Dublin Core (dc) Non-Authorial Contributors to the Document
dc_coverage List Dublin Core (dc) Describes the Scope or Extent of the Document
dc_creator List Dublin Core (dc) Names of Document’s Authors
dc_date List Dublin Core (dc) Datetime Object of Significance to the Document
dc_description Dictionary Dublin Core (dc) Descriptions of the Document’s Contents
dc_format String Dublin Core (dc) Document’s MIME-Type
dc_identifier String Dublin Core (dc) Document’s Unique Identifier
dc_language List Dublin Core (dc) Languages Used in the Document
dc_publisher List Dublin Core (dc) Publisher of the Document
dc_relation List Dublin Core (dc) Relationships to Other Documents
dc_rights Dictionary Dublin Core (dc) User’s Rights to the Document
dc_source String Dublin Core (dc) Unique Identifier of the Document’s Source
dc_subject List Dublin Core (dc) Keywords Indicating Document’s Subject
dc_title Dictionary Dublin Core (dc) Document’s Title
dc_type List Dublin Core (dc) Description of Document’s Type
pdf_keywords String Adobe PDF Schema Additional Listing of Document’s Keywords
pdf_pdfversion String Adobe PDF Schema PDF’s Version
pdf_producer String Adobe PDF Schema Tool that Created PDF Document
xmp_createDate String XMP Basic Schema Date the Document was Created
xmp_creatorTool String XMP Basic Schema First Tool Used to Create the Document’s Source
xmp_metadataDate Datetime XMP Basic Schema Object Most Recent Change Date of Metadata
xmp_modifyDate Datetime XMP Basic Schema Object Most Recent Change Date of Document
xmpmm_documentId String XMP Media Management Schema Common Identifier for All Versions of the Document
xmpmm_instanceId String XMP Media Management Schema Unique Identifier for this Particular Document

The XMP metadata attributes are grouped in schemas. Each schema is identified by a unique namespace, a URI, and holds an arbitrary number of properties. The most widely used predefined XMP schema is called the “Dublin Core”(“dc”). It includes general attributes such as (“dc:contributor”, “dc:coverage”, “dc:creator”…).

The correlation between both metadata types

The DID was deprecated in PDF 2.0. On a related note, the standard set of the DID attributes will be automatically updated when the XMP metadata attributes get updated.

The following table shows the correlation between the DID and XMP attributes:

DID attribute XMP attribute
/Title dc:title
/Author dc:creator
/Subject dc:description
/Keywords pdf:Keywords
/Creator xmp:CreatorTool
/Producer pdf:Producer
/CreationDate xmp:CreateDate
/ModDate xmp:ModifyDate

Conclusion

Much relevant information can be gleaned from PDF metadata. This information is of paramount importance mainly for forensic analysis and investigations.

In the next lesson, we will learn to manipulate PDF metadata.

Get hands-on with 1200+ tech skills courses.