Adventures in Machine Learning

Mastering PDF Manipulation with PyPDF2: Tips and Tricks

Working with PDF Files using PyPDF2

PDF files are widely used in today’s world, particularly for documents such as contracts, reports, and handbooks. PyPDF2 is a powerful Python library that allows you to create, manipulate, and extract valuable information from PDF files with ease.

It’s a great tool for automating and streamlining many tasks that involve working with PDFs. In this article, we will explore different PyPDF2 features and learn how you can use them to leverage the power of PDFs.

PyPDF2 Features

Here are some of the primary features that you can use with PyPDF2:

  • PDF Metadata: Extract valuable metadata from PDF files, such as the number of pages, author, creator app, and creation dates.
  • Extracting Content: Extract content such as text or images from PDFs to use in other applications.
  • Merge PDF files: Combine multiple PDFs into one file to create organized documents.
  • Rotate PDF file pages: Rotate pages within a PDF file to make them easier to read or view.
  • Scaling PDF pages: Adjust the size of pages in a PDF file to increase or decrease their size.
  • Extracting images from PDF pages: Extract images from existing PDF files to use in other applications.

Installing PyPDF2 Module

Before we dive into the different PyPDF2 features, let’s first learn how to install PyPDF2 on your system using the PIP package installation. You can install PyPDF2 by running the following command line:

pip install PyPDF2

Extracting PDF Metadata

PDF files contain valuable metadata that you can use to gain insight into them. Here are some examples of metadata that you can extract:

  • The PDF author: The author of the PDF document.
  • Creator app: The application that was used to create the PDF document.
  • Creation Dates: The date the PDF document was created.
  • Number of Pages: The total number of pages in the PDF document.

To extract the metadata from a PDF file, you need to open the file in binary mode, and then create an instance of the PdfFileReader class.

import PyPDF2
with open('example.pdf', 'rb') as pdf_file:
    pdf_reader = PyPDF2.PdfFileReader(pdf_file)
    info = pdf_reader.getDocumentInfo()
    print(info)

The output will display the PDF metadata.

Extracting Text of PDF Pages

The PdfFileReader class also lets you extract page content from a PDF file. Here is how you can extract text from the first page of a PDF file:

import PyPDF2
with open('example.pdf', 'rb') as pdf_file:
    pdf_reader = PyPDF2.PdfFileReader(pdf_file)
    first_page = pdf_reader.getPage(0)
    text = first_page.extractText()
    print(text)

Rotate PDF File Pages

You can also use PyPDF2 to rotate pages within a PDF file if they need to be displayed in a different orientation. Here is how you can rotate a page 90 degrees clockwise:

import PyPDF2
with open('example.pdf', 'rb') as pdf_file:
    pdf_reader = PyPDF2.PdfFileReader(pdf_file)
    page = pdf_reader.getPage(0)
    pdf_writer = PyPDF2.PdfFileWriter()
    page.rotateClockwise(90)
    pdf_writer.addPage(page)
    with open('rotated_example.pdf', 'wb') as output_file:
        pdf_writer.write(output_file)

Merge PDF Files

You can merge PDF files together into one larger document using PyPDF2. Here is how you can merge two PDF files together:

import PyPDF2
pdf_merger = PyPDF2.PdfFileMerger()
with open('file1.pdf', 'rb') as file1, open('file2.pdf', 'rb') as file2:
    pdf_merger.append(file1)
    pdf_merger.append(file2)
with open('merged_files.pdf', 'wb') as output_file:
    pdf_merger.write(output_file)

Split PDF Files into Single Pages Files

You may also want to split a PDF file into multiple single-page PDFs to ease accessibility and readability. Here’s how you can achieve that using PyPDF2:

import PyPDF2
with open('example.pdf', 'rb') as input_file:
    pdf_reader = PyPDF2.PdfFileReader(input_file)
    num_pages = pdf_reader.numPages
    for page in range(num_pages):
        pdf_writer = PyPDF2.PdfFileWriter()
        pdf_writer.addPage(pdf_reader.getPage(page))
        with open(f'page_{page + 1}.pdf', 'wb') as output_file:
            pdf_writer.write(output_file)

Extracting Images from PDF Files

PyPDF2 can also be used to extract images from PDF files. Here is a simple code that will extract the first page of a PDF file with images and saves an image:

import PyPDF2
from PIL import Image
with open('document.pdf', 'rb') as pdf_file:
    pdf_reader = PyPDF2.PdfFileReader(pdf_file)
    page = pdf_reader.getPage(0)
    xObject = page['/Resources']['/XObject'].getObject()
    for obj in xObject:
        if xObject[obj]['/Subtype'] == '/Image':
            size = (xObject[obj]['/Width'], xObject[obj]['/Height'])
            data = xObject[obj].getData()
            mode = ''
            if xObject[obj]['/ColorSpace'] == '/DeviceRGB':
                mode = 'RGB'
            else:
                mode = 'P'
            image = Image.frombytes(mode, size, data)
            image.save(f'{obj}.png')

PyPDF2 Examples

Now that we have gone through the different PyPDF2 features, let us explore some real-life examples of the various functionalities.

Extracting PDF Metadata

Suppose you have a PDF document, and you want to know more about it, such as the number of pages, author, and the date it was created. Here’s how you can extract the metadata using PyPDF2:

import PyPDF2
with open('example.pdf', 'rb') as pdf_file:
    pdf_reader = PyPDF2.PdfFileReader(pdf_file)
    metadata = pdf_reader.documentInfo
    print(metadata['/Author'])
    print(metadata['/CreationDate'])
    print(pdf_reader.getNumPages())

Extracting Text of PDF Pages

You could also extract the text of a specific page in a PDF, which is useful for collecting data or reading reports. Here is an example:

import PyPDF2
with open('example.pdf', 'rb') as pdf_file:
    pdf_reader = PyPDF2.PdfFileReader(pdf_file)
    page = pdf_reader.getPage(0)
    text = page.extractText()
    print(text)

Rotate PDF File Pages

Suppose you have an auto-generated PDF from a system, and all the pages are in a landscape format, while they are better viewed in Portrait. Here’s how you can rotate the pages in your file:

import PyPDF2
with open('example.pdf', 'rb') as pdf_file:
    pdf_reader = PyPDF2.PdfFileReader(pdf_file)
    pdf_writer = PyPDF2.PdfFileWriter()
    for page in range(pdf_reader.getNumPages()):
        current_page = pdf_reader.getPage(page)
        pdf_writer.addPage(current_page.rotateClockwise(90))
    output_file = open('rotated_example.pdf', 'wb')
    pdf_writer.write(output_file)

Merge PDF Files

You might have multiple files related to a project and want to combine them into a single document before sharing. Here’s how you can do that:

import PyPDF2
import contextlib
pdf_files = ['file1.pdf', 'file2.pdf', 'file3.pdf']
merged_pdf_file = 'merged_files.pdf'
with contextlib.ExitStack() as stack:
    files = [stack.enter_context(open(pdf, 'rb')) for pdf in pdf_files]
    pdf_merger = PyPDF2.PdfFileMerger()
    for file in files:
        pdf_reader = PyPDF2.PdfFileReader(file)
        pdf_merger.append(pdf_reader)
    output_file = open(merged_pdf_file, 'wb')
    pdf_merger.write(output_file)

Split PDF Files into Single Pages Files

You might have a PDF file with multiple pages, and you want each page as a separate file. Here’s how you can do that using PyPDF2:

import PyPDF2
pdf_file = open('example.pdf', 'rb')
pdf_reader = PyPDF2.PdfFileReader(pdf_file)
for page in range(pdf_reader.getNumPages()):
    pdf_writer = PyPDF2.PdfFileWriter()
    current_page = pdf_reader.getPage(page)
    pdf_writer.addPage(current_page)
    output_filename = f'extracted_page_{page + 1}.pdf'
    with open(output_filename, "wb") as out:
        pdf_writer.write(out)

Extracting Images from PDF Files

You might want to extract an image from a PDF document to use it somewhere else. Here is an example:

import PyPDF2
from PIL import Image
with open('document.pdf', 'rb') as pdf_file:
    pdf_reader = PyPDF2.PdfFileReader(pdf_file)
    page = pdf_reader.getPage(0)
    xObject = page['/Resources']['/XObject'].getObject()
    for obj in xObject:
        if xObject[obj]['/Subtype'] == '/Image':
            size = (xObject[obj]['/Width'], xObject[obj]['/Height'])
            data = xObject[obj].getData()
            mode = ''
            if xObject[obj]['/ColorSpace'] == '/DeviceRGB':
                mode = 'RGB'
            else:
                mode = 'P'
            image = Image.frombytes(mode, size, data)
            image.save(f'{obj}.png')

Conclusion

PyPDF2 is a powerful Python library that allows you to process PDF files programmatically through a wide range of features that it provides – from extracting metadata and content, splitting or merging files, or even extracting images from your files. You can now work with PDF documents within your workflows and automate repetitive tasks without leaving the comfort of your Python environment.

Hopefully, this article provided a good introduction to PyPDF2, through its features, and some examples of how to apply it in various situations. In this article, we have explored the primary features of PyPDF2 – a powerful Python library to create, manipulate, and extract valuable information from PDF files.

In the following section, we will take a more in-depth look at some of the key functionalities, namely, the PyPDF2 classes – PdfFileReader, PdfFileWriter, and PdfFileMerger, and explore various use-cases of the Pillow module, binary mode, getPage(), extractText(), rotateClockwise(), getNumPages(), and merge and split.

PyPDF2 Classes

PyPDF2 has three main classes that you use to do most of the work – PdfFileReader, PdfFileWriter, and PdfFileMerger. Here’s what they do:

  • PdfFileReader: Read the pages and metadata in a PDF file.
  • PdfFileWriter: Create new PDF files or Edit the existing ones.
  • PdfFileMerger: Combine one or more PDF files into a single document.

PdfFileReader

The PdfFileReader class is used to read a PDF file. You can use it to read the number of pages in a PDF, get the contents of a page, and extract metadata.

Here’s how you can use it:

import PyPDF2
with open('example.pdf', 'rb') as pdf_file:
    pdf_reader = PyPDF2.PdfFileReader(pdf_file)
    # Get number of pages
    num_pages = pdf_reader.getNumPages()
    print(f'Number of pages: {num_pages}n')
    # Get content from first page
    first_page = pdf_reader.getPage(0)
    print(first_page.extractText())
    # Get metadata from a PDF
    metadata = pdf_reader.getDocumentInfo()
    print(metadata)

In the sample code above, we opened the example.pdf file in binary mode so that we can read and write to it.

Next, we created an instance of the PdfFileReader class, which allowed us to extract the number of pages, the first page of the PDF, and metadata.

PdfFileWriter

If you want to edit an existing PDF or create a new one from scratch, you can use PdfFileWriter. Here’s how it works:

import PyPDF2
with open('example.pdf', 'rb') as pdf_file:
    pdf_reader = PyPDF2.PdfFileReader(pdf_file)
    pdf_writer = PyPDF2.PdfFileWriter()
    # Append a page to the end of the PDF
    pdf_writer.addPage(pdf_reader.getPage(0))
    # Rotate the page 90 degrees clockwise
    page = pdf_reader.getPage(1)
    page.rotateClockwise(90)
    # Add the page to PDF Writer
    pdf_writer.addPage(page)
    # Write the resulting PDF to a file
    with open('output.pdf', 'wb') as output_pdf:
        pdf_writer.write(output_pdf)

In the example above, we opened the sample PDF file in binary read mode and used the PdfFileReader class to read the first page. We used the PdfFileWriter class to create a new PDF file and added the first and second rotated pages to it.

Finally, we wrote the modified PDF to a new output file.

PdfFileMerger

The PdfFileMerger class is used to merge two or more PDF files into a single PDF. Here’s how it works:

import PyPDF2
pdf_files = ['file1.pdf', 'file2.pdf', 'file3.pdf']
pdf_merger = PyPDF2.PdfFileMerger()
for file in pdf_files:
    with open(file, 'rb') as pdf_file:
        pdf_merger.append(fileobj=pdf_file)
with open('merged_pdf.pdf', 'wb') as output_file:
    pdf_merger.write(output_file)

In the example, we created PdfFileMerger object and iterated through a list of files to add them to the merger.

Once we’ve added all the PDF files, we specified the output file name and wrote the combined PDF to a new file.

Pillow Module

Pillow is a Python library that adds support for opening, manipulating, and saving many image file formats such as JPEG, PNG, BMP, and TIFF. It is useful when working with PDF files because you can extract images from the document and use them in your application.

import PyPDF2
from PIL import Image
with open('document.pdf', 'rb') as pdf_file:
    pdf_reader = PyPDF2.PdfFileReader(pdf_file)
    page = pdf_reader.getPage(0)
    xObject = page['/Resources']['/XObject'].getObject()
    for obj in xObject:
        if xObject[obj]['/Subtype'] == '/Image':
            size = (xObject[obj]['/Width'], xObject[obj]['/Height'])
            data = xObject[obj].getData()
            mode = ''
            if xObject[obj]['/ColorSpace'] == '/DeviceRGB':
                mode = 'RGB'
            else:
                mode = 'P'
            image = Image.frombytes(mode, size, data)
            image.save(f'{obj}.png')

In the example above, the page object retrieves the first page of the PDF document.

We then loop through the ‘/XObject’ objects to find the images in the current XObject, the images are then converted to PIL Image format and finally saved to disk.

Binary Mode

When working with PDF files, you typically need to open them in binary mode. This is because PDF files contain large amounts of binary data and are not plain text files.

Here’s how you can work with PDF files in binary mode:

with open('example.pdf', 'rb') as pdf_file:
    pdf_reader = PyPDF2.PdfFileReader(pdf

Popular Posts