PyPDF2 Library for Working with PDF Files in Python
This article was published as a part of the Data Science Blogathon
Introduction
PDF stands for Portable Document Format. It uses.pdf extension. This type of file is mostly used for sharing purposes. They cannot be modified, thereby preserving the formatting of the file intact. Hence they can be easily shared and downloaded. They are meant for reading and not editing. They look similar on any device they are opened independent of the hardware, software, and operating system. Hence, they are the most widely used format. It was invented by Adobe. It is now an open standard by International Organization for Standardization (ISO).
In this tutorial, we will learn how to work with PDF files in Python. The following topics will be covered:
- How to extract text from a PDF file.
- How to rotate pages of a PDF file.
- How to extract document information from a PDF file.
- How to split pages from a PDF file.
- How to merge pages of a PDF file.
- How to encrypt PDF files.
- How to add a watermark to a PDF file.
Some Common Libraries for PDFs in Python
There are many libraries available freely for working with PDFs:
1. PDFMiner: It is an open-source tool for extracting text from PDF. It is used for performing analysis on the data. It can also be used as a PDF transformer or PDF parser.
2. PDFQuery: It is a lightweight python wrapper around PDFMiner, Ixml, and PyQuery. It is a fast, user-friendly PDF scraping library.
3. Tabula.py: It is a python wrapper for tabula.java. It converts PDF files into Pandas’ data frame and further all data manipulation operations can be performed on the data frame.
4. Xpdf: It allows conversion of PDFs into text.
5. pdflib: It is an extension of the poppler library with python bindings present in it.
6. Slate: It is a Python package based on the PDFMiner and used for extraction of text from PDF.
7. PyPDF2: It is a python library used for performing major tasks on PDF files such as extracting the document-specific information, merging the PDF files, splitting the pages of a PDF file, adding watermarks to a file, encrypting and decrypting the PDF files, etc. We will use the PyPDF2 library in this tutorial. It is a pure python library so it can run on any platform without any platform-related dependencies on any external libraries.
Installing the PyPDF2 Library
To install PyPDF2, copy the following commands in the command prompt and run:
pip install PyPDF2
Getting the document details
PyPDF2 provides metadata about the PDF document. This can be useful information about the PDF files. Information like the author of the document, title, producer, Subject, etc is available directly.
.jpg)
To extract the above information, run the following code:
from PyPDF2 import PdfFileReader pdf_path=r"C:UsersDellDesktopTesting Tesseractexample.pdf" with open(pdf_path, 'rb') as f: pdf = PdfFileReader(f) information = pdf.getDocumentInfo() number_of_pages = pdf.getNumPages() print(information)
The output of the above code is as follows:

Let us format the output:
print("Author" +': ' + information.author) print("Creator" +': ' + information.creator) print("Producer" +': ' + information.producer)

Extracting Text from PDF
To extract text, we will read the file and create a PDF object of the file.
# creating a pdf file object pdfFileObject = open(pdf_path, 'rb')
Then we will create a PDFReader class object and pass PDF File Object to it.
pdfReader = PyPDF2.PdfFileReader(pdfFileObject)
And Finally, we will extract each page and concatenate the text of each page.
text='' for i in range(0,pdfReader.numPages): # creating a page object pageObj = pdfReader.getPage(i) # extracting text from page text=text+pageObj.extractText() print(text)
The output text is as follows:

Rotating the pages of a PDF
To rotate a page of a PDF file and save it another file, copy the following code and run it.
pdf_read = PdfFileReader(r"C:UsersDellDesktopstory.pdf") pdf_write = PdfFileWriter() # Rotate page 90 degrees to the right page1 = pdf_read.getPage(0).rotateClockwise(90) pdf_write.addPage(page1) with open(r'C:UsersDellDesktoprotate_pages.pdf', 'wb') as fh: pdf_write.write(fh)

Merging PDF files in Python
We can also merge two or more PDF files using the following commands:
pdf_read = PdfFileReader(r”C:UsersDellDesktopstory.pdf”)
pdf_write = PdfFileWriter() # Rotate page 90 degrees to the right page1 = pdf_read.getPage(0).rotateClockwise(90) pdf_write.addPage(page1) with open(r'C:UsersDellDesktoprotate_pages.pdf', 'wb') as fh: pdf_write.write(fh)
The output PDF is shown below:

Splitting the pages of PDF
We can split a PDF into separate pages and save them again as PDFs.
fname = os.path.splitext(os.path.basename(pdf_path))[0] for page in range(pdf.getNumPages()): pdfwrite = PdfFileWriter() pdfwrite.addPage(pdf.getPage(page)) outputfilename = '{}_page_{}.pdf'.format( fname, page+1) with open(outputfilename, 'wb') as out: pdfwrite.write(out) print('Created: {}'.format(outputfilename))
pdf = PdfFileReader(pdf_path)
Encrypting a PDF file
Encryption of a PDF file means adding a password to the file. Each time the file is opened, it prompts to give the password for the file. It allows the content to be password protected. The following popup comes up:

We can use the following code for the same:
for page in range(pdf.getNumPages()): pdfwrite.addPage(pdf.getPage(page)) pdfwrite.encrypt(user_pwd=password, owner_pwd=None, use_128bit=True) with open(outputpdf, 'wb') as fh: pdfwrite.write(fh)
Adding a Watermark to the PDF file
A watermark is an identifying image or pattern that appears on each page. It can be a company logo or any strong information to be reflected on each page.
To add a watermark to each page of the PDF, copy the following code and run.
originalfile = r"C:UsersDellDesktopTesting Tesseractexample.pdf" watermark = r"C:UsersDellDesktopTesting Tesseractwatermark.pdf" watermarkedfile = r"C:UsersDellDesktopTesting Tesseractwatermarkedfile.pdf" watermark = PdfFileReader(watermark) watermarkpage = watermark.getPage(0) pdf = PdfFileReader(originalfile) pdfwrite = PdfFileWriter() for page in range(pdf.getNumPages()): pdfpage = pdf.getPage(page) pdfpage.mergePage(watermarkpage) pdfwrite.addPage(pdfpage) with open(watermarkedfile, 'wb') as fh: pdfwrite.write(fh)
The above code reads two files- the input file and the watermark. Then after reading each page it attaches the watermark to each page and saves the new file in the same location.

End Notes
As we have seen above, all the operations that could be thought of in a PDF file can be easily performed in Python using PyPDF2 library. It is purely written in Python. Therefore it is completely platform-independent. It is easy to use and provides great flexibility.
It never goes without saying:
Thanks for reading!
Image Source
- Image 1: https://monkeypen.com/pages/free-childrens-books
The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.