PyPDF2 Library for Working with PDF Files in Python

Siddharth Last Updated : 20 Nov, 2024

9 min read

PDF stands for Portable Document Format and has a .pdf file extension. Users predominantly utilize this format for document sharing because it preserves the original formatting, ensuring that documents appear consistent across various platforms, regardless of the hardware, software, or operating system used. This consistency makes PDFs the format of choice for distributing, viewing, and ensuring the integrity of documents on a global scale.

Originally developed by Adobe, PDF has transcended its proprietary origins to become an open standard, governed by the International Organization for Standardization (ISO). This transition to an ISO standard has further cemented PDF’s position as a cornerstone of digital document management, facilitating its adoption in a wide range of applications from academic publishing to business communications.

In this tutorial, we will learn how to work with PDF files in Python. The following topics will be covered:

How to extract text from a PDF file.
How to rotate pages of a PDF file.
How to extract document information from a PDF file.
How to split pages from a PDF file.
How to encrypt PDF files.
How to add a watermark to a PDF file.

This article was published as a part of the Data Science Blogathon.

Some Common Libraries for PDFs in Python
Getting Started with the PyPDF2 Library
Key Features
Use Cases of PyPDF2
Installing the PyPDF2 Library
Getting the Document Details
Extracting Text from PDF
Rotating the pages of a PDF
Merging PDF files in Python
Splitting the Pages of PDF
Encrypting a PDF File
Adding a Watermark to the PDF File
Conclusion
Frequently Asked Questions

Some Common Libraries for PDFs in Python

There are many libraries available freely for working with PDFs:

PDFMiner: It is an open-source tool for extracting text from PDF. It is used for performing analysis on the data. It can also be used as a PDF transformer or PDF parser.
PDFQuery: It is a lightweight python wrapper around PDFMiner, Ixml, and PyQuery. It is a fast, user-friendly PDF scraping library.
Tabula.py: It is a python wrapper for tabula.java. It converts PDF files into a Pandas data frame, allowing you to perform all data manipulation operations on the data frame.
Xpdf: It allows conversion of PDFs into text.
pdflib: It is an extension of the poppler library with python bindings present in it.
Slate: It is a Python package based on the PDFMiner and used for extraction of text from PDF.
PyPDF2: This Python library performs major tasks on PDF files, such as extracting document-specific information, merging PDF files, splitting pages of a PDF file, adding watermarks to a file, and encrypting or decrypting PDF files. We will use the PyPDF2 library in this tutorial. It is a pure python library so it can run on any platform without any platform-related dependencies on any external libraries.

Getting Started with the PyPDF2 Library

PyPDF2 is a comprehensive Python library designed for the manipulation of PDF files. It enables users to create, modify, and extract content from PDF documents. Built entirely in Python, PyPDF2 does not rely on any external modules, making it an accessible tool for Python developers.

The library offers a dual API system to cater to different programming needs. The low-level API, inspired by Pygments, provides the capability to craft programs that can generate or manipulate documents with high efficiency. On the other hand, the high-level API, influenced by ReportLab, simplifies the creation of complex documents—ranging from forms to entire books or magazines—with minimal coding effort.

Key Features

Transformation of PDFs into image formats like PNG or JPEG, as well as conversion into text files.
Generation of new PDF documents from the ground up.
Modification of existing PDFs through the addition, deletion, or alteration of pages.
Advanced editing features such as page rotation, watermark addition, font adjustments, and more.
The ability to secure documents with digital signatures, provided the necessary certificates are available.

Designed for efficiency, PyPDF2 leverages native C code for intensive operations like parsing, ensuring optimal performance without compromising the simplicity of its Pythonic interface. Additionally, the library is thread-safe, boasting a modest memory footprint approximately the size of Python’s own (around 1MB), making it both powerful and lightweight for developers looking to manage PDF documents in their projects.

Use Cases of PyPDF2

PyPDF2’s flexibility and command-line interface make it an ideal choice for integrating PDF processing into your workflow or Python projects. Below are some practical applications where PyPDF2 excels:

PDF Conversion to Word or Other Formats

Traditionally, converting PDFs into Word or other file formats requires specialized software for each conversion type, which can be inefficient, especially when handling multiple documents. PyPDF2 offers a streamlined alternative, enabling users to automate the conversion process within their Python scripts or via command-line instructions, significantly simplifying the task of converting PDF files into desired formats.

Merging Multiple PDF Documents

Whether you’re compiling reports, combining chapters of a book, or consolidating financial statements, PyPDF2 simplifies the process of merging multiple PDF files into a single document. This capability is invaluable for creating cohesive documents from disparate sources, enhancing organization and accessibility.

Modifying PDF Document Contents

PyPDF2’s functionality extends beyond basic file manipulation, allowing for detailed modifications within PDF documents. Users can add or remove pages, extract text for analysis, and even insert images or other objects into existing PDFs. This level of control makes PyPDF2 a versatile tool for tailoring documents to specific requirements.

Splitting PDFs into Smaller Segments

Large PDF documents can be unwieldy, making them difficult to share or process. PyPDF2 addresses this challenge by providing robust tools for splitting a single, large document into smaller, more manageable files. Whether you need to divide a document by page number, at regular intervals (every n pages), or according to document metadata such as author or title, PyPDF2 equips you with the necessary functionality.

Enhancements and Additional Use Cases

Extracting and Analyzing PDF Content: PyPDF2 can be employed for text mining and analysis, extracting text from PDFs for use in data analysis projects, natural language processing, or content aggregation.
Automating Report Generation: Automate the creation of reports by compiling data and text into professionally formatted PDFs, complete with dynamically added graphs, tables, and text.
Securing PDFs: Implement security measures by encrypting PDFs, setting permissions, or adding digital signatures to protect sensitive information.
Custom PDF Creation: Generate customized PDFs from scratch, using PyPDF2 to programmatically create documents that meet precise layout and content specifications.In summary, PyPDF2 is not just a library for manipulating PDF files; it’s a comprehensive toolset that caters to a wide array of PDF-related tasks, from document conversion and merging to complex modifications and automated report generation. Its capabilities make it an indispensable resource for professionals, developers, and hobbyists alike, seeking to streamline their document management processes.

Also Read: Transforming PDF Summary using Python

Installing the PyPDF2 Library

To install PyPDF2, copy the following commands in the command prompt and run:

pip install PyPDF2

Getting the Document Details

PyPDF2 provides metadata about the PDF document. This can be useful information about the PDF files. Information like the author of the document, title, producer, Subject, etc is available directly.

To extract the above information, run the following code:

from PyPDF2 import PdfFileReader
pdf_path=r"C:UsersDellDesktopTesting Tesseractexample.pdf"
with open(pdf_path, 'rb') as f:
        pdf = PdfFileReader(f)
        information = pdf.getDocumentInfo()
        number_of_pages = pdf.getNumPages()
        print(information)

The output of the above code is as follows:

Let us format the output:

print("Author" +': ' + information.author)
print("Creator" +': ' + information.creator)
print("Producer" +': ' + information.producer)

Extracting Text from PDF

Extracting text from PDFs with PyPDF2 can be challenging due to its restricted capabilities in text extraction. The output generated by the code might not be well-formatted, often resulting in an output cluttered with line break characters, a consequence of PyPDF2’s constrained text extraction support.

To extract text, we will read the file and create a PDF object of the file.

# creating a pdf file object
pdfFileObject = open(pdf_path, 'rb')

Then we will create a PDFReader class object and pass PDF File Object to it.

# creating a pdf reader object

pdfReader = PyPDF2.PdfFileReader(pdfFileObject)

And Finally, we will extract each page and concatenate the text of each page.

text=''
for i in range(0,pdfReader.numPages):
    # creating a page object
    pageObj = pdfReader.getPage(i)
    # extracting text from page
    text=text+pageObj.extractText()
print(text)

The output text is as follows:

Rotating the pages of a PDF

To rotate a page of a PDF file and save it another file, copy the following code and run it.

pdf_read = PdfFileReader(r"C:UsersDellDesktopstory.pdf")
pdf_write = PdfFileWriter()
# Rotate page 90 degrees to the right
page1 = pdf_read.getPage(0).rotateClockwise(90)
pdf_write.addPage(page1)
with open(r'C:UsersDellDesktoprotate_pages.pdf', 'wb') as fh:
    pdf_write.write(fh)

Merging PDF files in Python

We can also merge two or more PDF files using the following commands:

pdf_read = PdfFileReader(r”C:UsersDellDesktopstory.pdf”)

pdf_write = PdfFileWriter()
# Rotate page 90 degrees to the right
page1 = pdf_read.getPage(0).rotateClockwise(90)
pdf_write.addPage(page1)
with open(r'C:UsersDellDesktoprotate_pages.pdf', 'wb') as fh:
    pdf_write.write(fh)

The output PDF is shown below:

Splitting the Pages of PDF

We can split a PDF into separate pages and save them again as PDFs.

fname = os.path.splitext(os.path.basename(pdf_path))[0]
    for page in range(pdf.getNumPages()):
        pdfwrite = PdfFileWriter()
        pdfwrite.addPage(pdf.getPage(page))
        outputfilename = '{}_page_{}.pdf'.format(
            fname, page+1)
        with open(outputfilename, 'wb') as out:
            pdfwrite.write(out)
        print('Created: {}'.format(outputfilename))

 pdf = PdfFileReader(pdf_path)

Encrypting a PDF File

Encryption of a PDF file means adding a password to the file. Each time the file is opened, it prompts to give the password for the file. It allows the content to be password protected. The following popup comes up:

We can use the following code for the same:

for page in range(pdf.getNumPages()):
        pdfwrite.addPage(pdf.getPage(page))
    pdfwrite.encrypt(user_pwd=password, owner_pwd=None,
                      use_128bit=True)
    with open(outputpdf, 'wb') as fh:
        pdfwrite.write(fh)

Adding a Watermark to the PDF File

A watermark is an identifying image or pattern that appears on each page. It can be a company logo or any strong information to be reflected on each page. To add a watermark to each page of the PDF, copy the following code and run.

originalfile = r"C:UsersDellDesktopTesting Tesseractexample.pdf"
watermark = r"C:UsersDellDesktopTesting Tesseractwatermark.pdf"
watermarkedfile = r"C:UsersDellDesktopTesting Tesseractwatermarkedfile.pdf"
watermark = PdfFileReader(watermark)
watermarkpage = watermark.getPage(0)
pdf = PdfFileReader(originalfile)
pdfwrite = PdfFileWriter()
for page in range(pdf.getNumPages()):
    pdfpage = pdf.getPage(page)
    pdfpage.mergePage(watermarkpage)
    pdfwrite.addPage(pdfpage)
with open(watermarkedfile, 'wb') as fh:
    pdfwrite.write(fh)

The above code reads two files- the input file and the watermark. Then after reading each page it attaches the watermark to each page and saves the new file in the same location.

Conclusion

PyPDF2 stands out as a highly accessible solution for PDF file conversion, celebrated for its open-source nature and integration capabilities. Its comprehensive online documentation, hosted on GitHub, ensures that even those pressed for time can quickly find their way through setup and execution, streamlining the learning curve with well-organized docs and examples. For those seeking further assistance or looking to contribute, the PyPDF2 community on GitHub welcomes inquiries and contributions, fostering an environment of support and continuous improvement.

This library is not only user-friendly but also designed with automation and integration in mind, making it a go-to choice for developers looking to incorporate PDF manipulation into their workflows or applications. Since PyPDF2 is available on PyPI, installing it is straightforward for any Python project, and its compatibility with HTML and other formats enhances its versatility in handling various document conversion tasks.

With no dependencies other than Python, PyPDF2 promises exceptional portability across different operating systems, ensuring developers can deploy it in diverse environments without compatibility issues. The BSD-style license under which PyPDF2 is released allows developers to include it in commercial software packages without legal concerns.

In essence, PyPDF2 serves as an invaluable tool for Python developers interested in automating PDF manipulation, providing an optimal blend of ease of use, efficiency, and adaptability. Whether you’re generating reports, converting documents, or integrating PDF functionalities into larger systems, PyPDF2’s robust feature set and supportive community make it a highly recommended resource.

Frequently Asked Questions

Q1. Can I use Python 3 for PDF manipulation?

A. Yes, Python 3 supports various libraries for PDF manipulation, such as PyPDF2, PDFMiner, and pdflib. These libraries allow you to perform operations like extracting text, html, merging, splitting, and encrypting PDFs in a Python 3 environment.

Q2. What is PyPDF for Python?

A. PyPDF refers to libraries like PyPDF2 and PyPDF4, which are Python libraries that allow users to work with PDF files. They provide functionalities for extracting information, merging, splitting, encrypting, and decrypting PDF documents.

Q3. Can I append text to a PDF using Python?

A. Appending text directly to a PDF is complex due to the format’s nature. Instead, you can use Python to add text as annotations or by creating a new PDF with the text and then merging it with the original PDF using PyPDF2.

Q4. Is it possible to decrypt PDF files with Python?

A. Yes, PyPDF2 allows you to decrypt PDF files, provided you have the necessary permissions and the password.

Q5. How do I manipulate Excel files in relation to PDFs in Python?

A. To work with Excel files and PDFs, you can use libraries like Pandas to manipulate Excel data and then use ReportLab or PyPDF2 to generate or manipulate PDFs based on that data.

Q6. What are some tips for working with PDFs on Linux using Python?

A. On Linux, ensure you have dependencies installed for libraries like PyPDF2 or PDFMiner. Use the Linux package manager to install any required system libraries for advanced operations like OCR.

The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.

Siddharth

Computer science enthusiast

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

PyPDF2 Library for Working with PDF Files in Python

Table of contents

Some Common Libraries for PDFs in Python

Getting Started with the PyPDF2 Library

Key Features

Use Cases of PyPDF2

PDF Conversion to Word or Other Formats

Merging Multiple PDF Documents

Modifying PDF Document Contents

Splitting PDFs into Smaller Segments

Enhancements and Additional Use Cases

Installing the PyPDF2 Library

Getting the Document Details

Extracting Text from PDF

Rotating the pages of a PDF

Merging PDF files in Python

Splitting the Pages of PDF

Encrypting a PDF File

Adding a Watermark to the PDF File

Conclusion

Frequently Asked Questions

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or

JSESSIONID

li_rm

AnalyticsSyncHistory

lms_analytics

liap

visit

li_at

s_plt

lang

s_tp

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

s_pltp

s_tslv

li_theme

li_theme_set

Google (11)

_gcl_au

SID

SAPISID

__Secure-#

APISID

SSID

HSID