Walkthrough Top Python Libraries for PDF Processing

Abstract

PDFs are designed natively to play a dual role of:

  • Faithfully communicating human-readable documents that are well-styled for printing.
  • Carrying out data of disparate types.

Generally, PDFs are composed of semi-structured or unstructured data, and this turns into difficulties when processing or pulling information out of these files.

The problem statement

PDF documents present a fundamental challenge for automated manipulation, taking into consideration the incoherent data that they store which does not have a predefined model or is not organized in a pre-defined manner.

Worth noting that unstructured data, like links, buttons, form fields, audio, video among others, lacks an identifiable structure or architecture. That said, running searches, deleting portions, or launching updates become cumbersome.

On the other hand, structured data has elements which are capable of being addressed for effective analysis, for example relational data, while semi-structured data have some organizational properties that make it simpler for analysis.

Python comes to the rescue

Python is object-oriented, high-level programming and interpreted language with dynamic semantics. The factors listed below make Python a very attractive language for Rapid Application Development:

  • High-level built-in data structures.
  • Dynamic typing.
  • Dynamic binding.

Moreover, these factors use Python as a scripting language or as a glue to link components together. Python’s simple and smooth-to-learn syntax emphasizes readability, and consequently reduces the cost of program supportability and maintenance. Python supports modules and packages that enforce program modularity and code reuse.

Over and over, programmers are attracted to Python because of the elevated productivity it provides.

Python is the best bet for PDF processing

Python is most frequently labeled as a batteries-inclusive language, and leverages well-integrated libraries to handle unstructured data sources like the PDF.

PDF processing comes under the umbrella of text analytics. Python leverages a whole bunch of useful text analytics libraries and frameworks that make it the perfect choice for affordable PDF manipulation.

When it comes to PDF management, there are many aspects to consider before making a decision. Although there are plenty of Desktop PDF editors, the top-ranked ones are expensive:

At the time of writing this course, the cost of the following products, which are considered the most popular PDF editors as per this ranking, was:

PDF Editor License Price
Foxit PDF Editor $199 Perpetual
pdfFiller by airSlate $8 per user/month
PDFelement $3180 Perpetual - 20 users
Acrobat Standard DC $155 per user/year
Acrobat Pro DC $175 per user/year
Acrobat Pro Cloud Plan $14.99 per user/month

Online PDF editors like IlovePDF do not guarantee the privacy and security of the processed documents, and pose high data confidentiality and integrity risks.

Various Python libraries for PDF processing

Python libraries and packages are a set of workable modules and functions that reduce the use of code in our day-to-day life, and play a vital role in simplifying our programming experience.

Although countless Python libraries deal with PDF processing, we will go through the most useful and handy ones, which will constitute the foundation of the utilities that we aim to develop throughout this course.

While exploring these Python libraries, we will shed light on their statistics, which have been collected from repositories that are publicly available on Github. These statistics will take into consideration the following metrics:

  • Stars: These indicate the level of appreciation of the project.
  • Forks: These reveal the number of copies of the project repository to introduce enhancements if possible.
  • Releases: These denote, at some instances the level of contribution to the designated project.

We will embark on this journey with the general-purpose libraries and look on those considered more sensible:

PyPDF4

PyPDF4 is a pure-Python library for PDF processing, built on top of PyPDF2 and capable of:

  • Extracting PDF information (title, author, …).
  • Splitting and merging documents page by page.
  • Cropping pages.
  • Combining multiple pages into a single page.
  • Encrypting and decrypting a PDF file.

By virtue of being a Pure-Python library, it is able to run on any Python platform without any dependencies. Moreover, it allows PDF manipulation in memory by leveraging the StringIO objects instead of the file streams. Therefore, it is mainly useful for websites that manage or manipulate PDFs.

PyPDF4 in summary:

# Stars # Forks # Releases Latest Release Latest Release Date Languages
224 970 11 1.27.0 07/08/2018 Python: 100%

PyPDF2 in summary:

# Stars # Forks # Releases Latest Release Latest Release Date Languages
3800 970 10 1.26.0 18/05/2016 Python: 99.8% / Shell: 0.2%

ReportLab

ReportLab is a robust open-source engine for creating complex, data-driven PDF documents.

ReportLab comes with two versions: open-source ReportLab, and commercial ReportLab PLUS.

ReportLab is free, open-source, and written in Python. This ubiquitous package sees 50,000+ downloads per month, and was chosen to harness the print and export features for Wikipedia.

To respond to real-world reporting needs, mainly those of large institutions, the ReportLab Toolkit has evolved throughout the years. The library has three major layers:

  • A page layout engine that constructs documents from elements such as paragraphs, fonts, tables, headlines, and vector graphics.
  • A charts and widgets library for building data graphics.
  • A graphics canvas API that portrays PDF pages.

Derived from the ReportLab open-source Toolkit, the commercial product ReportLab PLUS emerged. It is capable of generating PDF documents at a higher speed, and allows the usage of the smart XML-based language RML. ReportLab PLUS offers substantial enhancements over the ReportLab.

ReportLab in summary:

# Releases Latest Release Latest Release Date
62 3.6.1 06/08/2021

Most prominently, Wikipedia uses ReportLab to generate its PDF exports.

ReportLab comes up with an advanced mode called PLATYPUS (Page Layout and Typography Using Scripts), which enables the creation of dynamic layouts based on templates at the document and page level.

ReportLab Plus’s distinctive features are:

  • A specific language for building templates.
  • The ability to incorporate vector graphics.

It is worth noting that the Report Markup Language (RML) is based on an XML dialect and used for building templates.

PyMuPDF

PyMuPDF is a wrapper for the MuPDF library, a lightweight viewer for the PDF, XPS, and e-book.

MuPDF is distinguished by its performance and superior rendering quality, and it is supported by Artifex Software, Inc.

MuPDF grants access to files of various types like PDF, XPS, OpenXPS, CBZ, EPUB, and FB2 (e-books) formats.

PyMuPDF allows a plethora of features when dealing with PDF documents, which include:

  • Accessing the PDF document metadata, links, and bookmarks.
  • Rendering the document pages in raster formats, like PNG, or the vector formats, like SVG.
  • Extracting text and images and searching for text.
  • Converting the document pages to other formats.
  • Remodeling a document in a way that supports double-sided printing, embedding logos, or watermarks.
  • Decrypting a PDF document.

PyMuPDF in summary:

# Stars # Forks # Releases Latest Release Latest Release Date Languages
1100 204 91 1.18.17 24/08/2018 Sphinx

Pdf2dox

This library allows us to gather data (that is, text, images, and drawings) from a PDF document using the PyMuPDF library. Subsequently, it parses the layout and constructs, using the Python-Docx library, a document of type “Docx”.

Pdf2docx in summary:

# Stars # Forks # Releases Latest Release Latest Release Date Languages
196 54 19 0.5.2 30/05/2021 Python: 99.8% / Makefile: 0.2%

PDFNetPython3

PDFNetPython3 is a wrapper for the PDFTron SDK.

PDFTron is not freeware, and offers two types of licenses, depending on whether you’re developing an in-house solution, or an external or commercial product.

PDFTron SDK is an exhaustive PDF toolkit that allows us to build credible applications for viewing, creating, printing, editing, and annotating PDFs across numerous operating systems.

Developers make use of the PDFTron SDK to read, write, and edit PDF documents that are compliant with almost all PDF versions. This comprehensive PDF library underpins most use-case scenarios like printing, stamping, editing among others.

PDFNetPython3 in summary:

# Releases Latest Release Latest Release Date
6 9.1.0 27/08/2021

Borb

Borb is a pure Python library designed to read, edit, write and manipulate PDF files. It shows a PDF document as a JSON-like data structure.

This library includes extensive functions which include but are not limited to, the following:

  • Read a PDF document.
  • Extract and change PDF meta-information.
  • Extract text and images from a PDF.
  • Change images in a PDF.
  • Annotate a PDF.
  • Add text, tables, and lists to a PDF.

Borb in summary:

# Stars # Forks # Releases Latest Release Latest Release Date Languages
827 32 22 2.0.9 30/08/2021 Python: 99.0% / Other: 1.0%

This non-exhaustive list of libraries is dynamic and may vary depending on future releases of the cited libraries, or on new arrivals within this category.

Below, we’ll see a table showing some of the PDF processing functions, elaborated throughout this course, and the corresponding Python libraries we’ve relied upon to develop these features:

Feature PyPDF4 PyMuPDF ReportLab PDFNetPython3 Borb Pdf2docx
Metadata collection X
DID Metadata editing X
Creating PDF X
Adding Comments to PDF X
Splitting PDF pages X
Rotating PDF pages X
Removing PDF pages X
Shuffling PDF pages X
Dynamically watermarking PDF pages X X
Converting PDF pages Into Images X
Compressing PDF X
Digitally signing PDF X
Converting PDF to MS Word X

Conclusion

These libraries are extremely valuable when manipulating PDF documents, because they save time and provide explicit functions that one can build upon.

Many Python libraries provide a variety of functions to manipulate PDF files. We have short-listed some of them here for further study, as well as for later use for PDF manipulation across a wide range of functionalities covering several areas of PDF processing activities.

Now that we have familiarized ourselves with some general concepts in this introductory chapter, it’s time to tackle real-life scenarios related to PDF management. The rest of this course is structured in a way to reflect the categorization of functions associated with PDF management activities, as per the table below:

Category Functions covered
Core Metadata treatment, create and add comments.
Pages processing Split, rotate, remove, shuffle, dynamically watermark, and convert into images.
Content processing Extract tabular data, images, and hyperlinks, annotate text, redact text and parse text data.
Document processing Merge multiple PDFs, convert PDF to other files types, compress, secure, crack, digitally sign, manipulate scanned PDF, compute the checksum, and pinpoint the difference between two PDFs.