Megha — September 2, 2021
Beginner Project Python

This article was published as a part of the Data Science Blogathon

Introduction

PDF stands for Portable Document Format. It uses.pdf extension. This type of file is mostly used for sharing purposes. They cannot be modified, thereby preserving the formatting of the file intact. Hence they can be easily shared and downloaded.  They are meant for reading and not editing. They look similar on any device they are opened independent of the hardware, software, and operating system. Hence, they are the most widely used format. It was invented by Adobe. It is now an open standard by International Organization for Standardization (ISO).

In this tutorial, we will learn how to work with PDF files in Python. The following topics will be covered:

  • How to extract text from a PDF file.
  • How to rotate pages of a PDF file.
  • How to extract document information from a PDF file.
  • How to split pages from a PDF file.
  • How to merge pages of a PDF file.
  • How to encrypt PDF files.
  • How to add a watermark to a PDF file.

Some Common Libraries for PDFs in Python

There are many libraries available freely for working with PDFs:

1. PDFMiner: It is an open-source tool for extracting text from PDF. It is used for performing analysis on the data. It can also be used as a PDF transformer or PDF parser.

2. PDFQuery: It is a lightweight python wrapper around PDFMiner, Ixml, and PyQuery. It is a fast, user-friendly PDF scraping library.

 3. Tabula.py: It is a python wrapper for tabula.java. It converts PDF files into Pandas’ data frame and further all data manipulation operations can be performed on the data frame.

4. Xpdf: It allows conversion of PDFs into text.

5. pdflib: It is an extension of the poppler library with python bindings present in it.

6. Slate: It is a Python package based on the PDFMiner and used for extraction of text from PDF.

7. PyPDF2: It is a python library used for performing major tasks on PDF files such as extracting the document-specific information, merging the PDF files, splitting the pages of a PDF file, adding watermarks to a file, encrypting and decrypting the PDF files, etc. We will use the PyPDF2 library in this tutorial. It is a pure python library so it can run on any platform without any platform-related dependencies on any external libraries.

Installing the PyPDF2 Library

To install PyPDF2, copy the following commands in the command prompt and run:

pip install PyPDF2

 

Getting the document details

PyPDF2 provides metadata about the PDF document. This can be useful information about the PDF files. Information like the author of the document, title, producer, Subject, etc is available directly.

PyPDF2  PDF files in Python 1

To extract the above information, run the following code:

from PyPDF2 import PdfFileReader
pdf_path=r"C:UsersDellDesktopTesting Tesseractexample.pdf"
with open(pdf_path, 'rb') as f:
        pdf = PdfFileReader(f)
        information = pdf.getDocumentInfo()
        number_of_pages = pdf.getNumPages()
        print(information)

The output of the above code is as follows:

PDF files in Python follows

Let us format the output:

print("Author" +': ' + information.author)
print("Creator" +': ' + information.creator)
print("Producer" +': ' + information.producer)

Extracting Text from PDF

To extract text, we will read the file and create a PDF object of the file.

# creating a pdf file object
pdfFileObject = open(pdf_path, 'rb')

Then we will create a PDFReader class object and pass PDF File Object to it.

# creating a pdf reader object
pdfReader = PyPDF2.PdfFileReader(pdfFileObject)

And Finally, we will extract each page and concatenate the text of each page.

text=''
for i in range(0,pdfReader.numPages):
    # creating a page object
    pageObj = pdfReader.getPage(i)
    # extracting text from page
    text=text+pageObj.extractText()
print(text)

The output text is as follows:

Extracting Text from PDF PyPDF2

Rotating the pages of a PDF 

To rotate a page of a PDF file and save it another file, copy the following code and run it.

pdf_read = PdfFileReader(r"C:UsersDellDesktopstory.pdf")
pdf_write = PdfFileWriter()
# Rotate page 90 degrees to the right
page1 = pdf_read.getPage(0).rotateClockwise(90)
pdf_write.addPage(page1)
with open(r'C:UsersDellDesktoprotate_pages.pdf', 'wb') as fh:
    pdf_write.write(fh)
Rotating the pages of a PDF  PyPDF2
Image 1

 

Merging PDF files in Python

We can also merge two or more PDF files using the following commands:

pdf_read = PdfFileReader(r”C:UsersDellDesktopstory.pdf”)

pdf_write = PdfFileWriter()
# Rotate page 90 degrees to the right
page1 = pdf_read.getPage(0).rotateClockwise(90)
pdf_write.addPage(page1)
with open(r'C:UsersDellDesktoprotate_pages.pdf', 'wb') as fh:
    pdf_write.write(fh)

The output PDF is shown below:

 

Merging

Splitting the pages of PDF

We can split a PDF into separate pages and save them again as PDFs.

fname = os.path.splitext(os.path.basename(pdf_path))[0]
    for page in range(pdf.getNumPages()):
        pdfwrite = PdfFileWriter()
        pdfwrite.addPage(pdf.getPage(page))
        outputfilename = '{}_page_{}.pdf'.format(
            fname, page+1)
        with open(outputfilename, 'wb') as out:
            pdfwrite.write(out)
        print('Created: {}'.format(outputfilename))
 pdf = PdfFileReader(pdf_path)

 

Encrypting a PDF file

Encryption of a PDF file means adding a password to the file. Each time the file is opened, it prompts to give the password for the file. It allows the content to be password protected. The following popup comes up:

Encrypting PyPDF2

We can use the following code for the same:

for page in range(pdf.getNumPages()):
        pdfwrite.addPage(pdf.getPage(page))
    pdfwrite.encrypt(user_pwd=password, owner_pwd=None,
                      use_128bit=True)
    with open(outputpdf, 'wb') as fh:
        pdfwrite.write(fh)

Adding a Watermark to the PDF file

A watermark is an identifying image or pattern that appears on each page. It can be a company logo or any strong information to be reflected on each page.
To add a watermark to each page of the PDF, copy the following code and run.

originalfile = r"C:UsersDellDesktopTesting Tesseractexample.pdf"
watermark = r"C:UsersDellDesktopTesting Tesseractwatermark.pdf"
watermarkedfile = r"C:UsersDellDesktopTesting Tesseractwatermarkedfile.pdf"
watermark = PdfFileReader(watermark)
watermarkpage = watermark.getPage(0)
pdf = PdfFileReader(originalfile)
pdfwrite = PdfFileWriter()
for page in range(pdf.getNumPages()):
    pdfpage = pdf.getPage(page)
    pdfpage.mergePage(watermarkpage)
    pdfwrite.addPage(pdfpage)
with open(watermarkedfile, 'wb') as fh:
    pdfwrite.write(fh)

The above code reads two files- the input file and the watermark. Then after reading each page it attaches the watermark to each page and saves the new file in the same location.

 

Adding a Watermark

End Notes

As we have seen above, all the operations that could be thought of in a PDF file can be easily performed in Python using PyPDF2 library. It is purely written in Python. Therefore it is completely platform-independent. It is easy to use and provides great flexibility.

It never goes without saying:

Thanks for reading!

Image Source

  1. Image 1: https://monkeypen.com/pages/free-childrens-books

The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.

About the Author

Our Top Authors

  • Analytics Vidhya
  • Guest Blog
  • Tavish Srivastava
  • Aishwarya Singh
  • Aniruddha Bhandari
  • Abhishek Sharma
  • Aarshay Jain

Download Analytics Vidhya App for the Latest blog/Article

Leave a Reply Your email address will not be published. Required fields are marked *