Introduction to Metadata
Get acquainted with PDF metadata and its subtleties.
Introduction
A PDF document is intrinsically rich in metadata artifacts, which can be valuable information during a digital forensic investigation. While there are multiple ways to extract this metadata from a PDF file, such techniques are either manual processes or do not encompass all the metadata artifacts.
What is metadata?
Simply put, metadata is defined as data, about data. Generally, the metadata of digital objects is divided into two categories:
-
The file system metadata
The file system metadata refers to data elements that are related to the hosting file itself and do not participate in the byte-sequence that constitutes the file’s binary structure.
-
The application metadata
The application metadata deals with the elements that are intrinsic to the file and participate in the binary’s byte-sequence.
For the sake of this course, we will emphasize the manipulation of the application metadata.
Application metadata types
The application metadata is stored within a PDF file as either a document information dictionary object or a metadata stream object.
The document information dictionary (DID) metadata has been part of the PDF since version 1.0. They cover general information about a PDF file by combining pairs of data objects consisting of a key and a matching value. The metadata streams available since PDF 1.4 (2001) are viewed as an elaborated mechanism for embedding more comprehensive metadata attributes in a PDF document. The contents of the metadata stream are represented in Extensible Markup Language (XML), and may include metadata for the entire PDF, and specific components within it.
-
The document information dictionary
There are nine attributes associated with the DID objects, listed below:
Key Name | Data Type | Value Description |
---|---|---|
/Title | Text | Title of the Document |
/Author | Text | Author of the Document |
/Subject | Text | Subject of the Document |
/Keywords | Text | Keywords Linked to the Document |
/Creator | Text | Application Originally Used to Create the PDF Document |
/Producer | Text | Application Originally Used to Convert the PDF Document |
/CreationDate | Date | Date and Time of Document’s Creation |
/ModDate | Date | Date and Time of Document’s Last Modification |
/Trapped | Name Object | Indicates If the Document Has Been Modified to Include Trapping Information |
The attributes of the DID follow a syntax rule and their keys are referenced as “/Key Name.” These attributes are typically contained within an object called the trailer.
-
The metadata streams
Parsing the metadata streams returns an object represented in a subset of the XML called the Extensible Metadata Platform (XMP). The resulting object includes twenty-five possible metadata values that are given in the following table:
Key Name | Type | Schema | Description |
---|---|---|---|
custom_properties | Dictionary | Custom schema properties | Custom Metadata Properties |
dc_contributor | List | Dublin Core (dc) | Non-Authorial Contributors to the Document |
dc_coverage | List | Dublin Core (dc) | Describes the Scope or Extent of the Document |
dc_creator | List | Dublin Core (dc) | Names of Document’s Authors |
dc_date | List | Dublin Core (dc) | Datetime Object of Significance to the Document |
dc_description | Dictionary | Dublin Core (dc) | Descriptions of the Document’s Contents |
dc_format | String | Dublin Core (dc) | Document’s MIME-Type |
dc_identifier | String | Dublin Core (dc) | Document’s Unique Identifier |
dc_language | List | Dublin Core (dc) | Languages Used in the Document |
dc_publisher | List | Dublin Core (dc) | Publisher of the Document |
dc_relation | List | Dublin Core (dc) | Relationships to Other Documents |
dc_rights | Dictionary | Dublin Core (dc) | User’s Rights to the Document |
dc_source | String | Dublin Core (dc) | Unique Identifier of the Document’s Source |
dc_subject | List | Dublin Core (dc) | Keywords Indicating Document’s Subject |
dc_title | Dictionary | Dublin Core (dc) | Document’s Title |
dc_type | List | Dublin Core (dc) | Description of Document’s Type |
pdf_keywords | String | Adobe PDF Schema | Additional Listing of Document’s Keywords |
pdf_pdfversion | String | Adobe PDF Schema | PDF’s Version |
pdf_producer | String | Adobe PDF Schema | Tool that Created PDF Document |
xmp_createDate | String | XMP Basic Schema | Date the Document was Created |
xmp_creatorTool | String | XMP Basic Schema | First Tool Used to Create the Document’s Source |
xmp_metadataDate | Datetime | XMP Basic Schema | Object Most Recent Change Date of Metadata |
xmp_modifyDate | Datetime | XMP Basic Schema | Object Most Recent Change Date of Document |
xmpmm_documentId | String | XMP Media Management Schema | Common Identifier for All Versions of the Document |
xmpmm_instanceId | String | XMP Media Management Schema | Unique Identifier for this Particular Document |
The XMP metadata attributes are grouped in schemas. Each schema is identified by a unique namespace, a URI, and holds an arbitrary number of properties. The most widely used predefined XMP schema is called the “Dublin Core”(“dc”). It includes general attributes such as (“dc:contributor”, “dc:coverage”, “dc:creator”…).
The correlation between both metadata types
The DID was deprecated in PDF 2.0. On a related note, the standard set of the DID attributes will be automatically updated when the XMP metadata attributes get updated.
The following table shows the correlation between the DID and XMP attributes:
DID attribute | XMP attribute |
---|---|
/Title | dc:title |
/Author | dc:creator |
/Subject | dc:description |
/Keywords | pdf:Keywords |
/Creator | xmp:CreatorTool |
/Producer | pdf:Producer |
/CreationDate | xmp:CreateDate |
/ModDate | xmp:ModifyDate |
Conclusion
Much relevant information can be gleaned from PDF metadata. This information is of paramount importance mainly for forensic analysis and investigations.
In the next lesson, we will learn to manipulate PDF metadata.
Get hands-on with 1200+ tech skills courses.