Manipulating Scanned PDF Files

Introduction

PDF documents are mainly created in two different ways. They are either generated by an electronic source, known as a native PDF, or by scanning in paper documents, known as a scanned PDF.

Native PDF documents contain an internal structure that can be read and interpreted, whereas scanned PDFs consist of scanned images, meaning that their content cannot be searched or edited.

Performing OCR on a scanned PDF

Optical Character Recognition (OCR) is an adaptive technology that turns printed or written text into an electronic character-based file using a visual recognition process.

For instance, to convert a scanned PDF to an editable format such as a Text or MS. Word document, an OCR software is needed to analyze the “image” of each character that has been scanned in, and match it to an electronic character-based file.

Scope

Whether you are struggling to extract information from scanned PDF contracts, invoices, or purchase orders, this lesson will aid us in developing a PDF OCR tool using the Python programming language.

Get hands-on with 1200+ tech skills courses.