Adventures in Machine Learning

Mastering PDF Manipulation with PyPDF2: Tips and Tricks

Working with PDF Files using PyPDF2

PDF files are widely used in todays world, particularly for documents such as contracts, reports, and handbooks. PyPDF2 is a powerful Python library that allows you to create, manipulate and extract valuable information from PDF files with ease.

It’s a great tool for automating and streamlining many tasks that involve working with PDFs. In this article, we will explore different PyPDF2 features and learn how you can use them to leverage the power of PDFs.

PyPDF2 Features

Here are some of the primary features that you can use with PyPDF2:

– PDF Metadata: Extract valuable metadata from PDF files, such as the number of pages, author, creator app, and creation dates. – Extracting Content: Extract content such as text or images from PDFs to use in other applications.

– Merge PDF files: Combine multiple PDFs into one file to create organized documents. – Rotate PDF file pages: Rotate pages within a PDF file to make them easier to read or view.

– Scaling PDF pages: Adjust the size of pages in a PDF file to increase or decrease their size. – Extracting images from PDF pages: Extract images from existing PDF files to use in other applications.

Installing PyPDF2 Module

Before we dive into the different PyPDF2 features, let’s first learn how to install PyPDF2 on your system using the PIP package installation. You can install PyPDF2 by running the following command line:

“` pip install PyPDF2 “`

Extracting PDF Metadata

PDF files contain valuable metadata that you can use to gain insight into them. Here are some examples of metadata that you can extract:

– The PDF author: The author of the PDF document.

– Creator app: The application that was used to create the PDF document. – Creation Dates: The date the PDF document was created.

– Number of Pages: The total number of pages in the PDF document. To extract the metadata from a PDF file, you need to open the file in binary mode, and then create an instance of the

PdfFileReader class.

“` python

import PyPDF2

with open(‘example.pdf’, ‘rb’) as pdf_file:

pdf_reader = PyPDF2. PdfFileReader(pdf_file)

info = pdf_reader.getDocumentInfo()

print(info)

“`

The output will display the PDF metadata.

Extracting Text of PDF Pages

The

PdfFileReader class also lets you extract page content from a PDF file. Here is how you can extract text from the first page of a PDF file:

“` python

import PyPDF2

with open(‘example.pdf’, ‘rb’) as pdf_file:

pdf_reader = PyPDF2. PdfFileReader(pdf_file)

first_page = pdf_reader.getPage(0)

text = first_page.extractText()

print(text)

“`

Rotate PDF File Pages

You can also use PyPDF2 to rotate pages within a PDF file if they need to be displayed in a different orientation. Here is how you can rotate a page 90 degrees clockwise:

“` python

import PyPDF2

with open(‘example.pdf’, ‘rb’) as pdf_file:

pdf_reader = PyPDF2. PdfFileReader(pdf_file)

page = pdf_reader.getPage(0)

pdf_writer = PyPDF2.

PdfFileWriter()

page.rotateClockwise(90)

pdf_writer.addPage(page)

with open(‘rotated_example.pdf’, ‘wb’) as output_file:

pdf_writer.write(output_file)

“`

Merge PDF Files

You can merge PDF files together into one larger document using PyPDF2. Here is how you can merge two PDF files together:

“` python

import PyPDF2

pdf_merger = PyPDF2. PdfFileMerger()

with open(‘file1.pdf’, ‘rb’) as file1, open(‘file2.pdf’, ‘rb’) as file2:

pdf_merger.append(file1)

pdf_merger.append(file2)

with open(‘merged_files.pdf’, ‘wb’) as output_file:

pdf_merger.write(output_file)

“`

Split PDF Files into Single Pages Files

You may also want to split a PDF file into multiple single-page PDFs to ease accessibility and readability. Heres how you can achieve that using PyPDF2:

“` python

import PyPDF2

with open(‘example.pdf’, ‘rb’) as input_file:

pdf_reader = PyPDF2. PdfFileReader(input_file)

num_pages = pdf_reader.numPages

for page in range(num_pages):

pdf_writer = PyPDF2.

PdfFileWriter()

pdf_writer.addPage(pdf_reader.getPage(page))

with open(f’page_{page + 1}.pdf’, ‘wb’) as output_file:

pdf_writer.write(output_file)

“`

Extracting Images from PDF Files

PyPDF2 can also be used to extract images from PDF files. Here is a simple code that will extract the first page of a PDF file with images and saves an image:

“` python

import PyPDF2

from PIL import Image

with open(‘document.pdf’, ‘rb’) as pdf_file:

pdf_reader = PyPDF2. PdfFileReader(pdf_file)

page = pdf_reader.getPage(0)

xObject = page[‘/Resources’][‘/XObject’].getObject()

# Iterate through all objects and find images

for obj in xObject:

if xObject[obj][‘/Subtype’] == ‘/Image’:

size = (xObject[obj][‘/Width’], xObject[obj][‘/Height’])

data = xObject[obj].getData()

mode = ”

if xObject[obj][‘/ColorSpace’] == ‘/DeviceRGB’:

mode = ‘RGB’

else:

mode = ‘P’

image = Image.frombytes(mode, size, data)

image.save(f'{obj}.png’)

“`

PyPDF2 Examples

Now that we have gone through the different PyPDF2 features, let us explore some real-life examples of the various functionalities.

Extracting PDF Metadata

Suppose you have a PDF document, and you want to know more about it, such as the number of pages, author, and the date it was created. Heres how you can extract the metadata using PyPDF2:

“` python

import PyPDF2

with open(‘example.pdf’, ‘rb’) as pdf_file:

pdf_reader = PyPDF2. PdfFileReader(pdf_file)

metadata = pdf_reader.documentInfo

print(metadata[‘/Author’])

print(metadata[‘/CreationDate’])

print(pdf_reader.getNumPages())

“`

Extracting Text of PDF Pages

You could also extract the text of a specific page in a PDF, which is useful for collecting data or reading reports. Here is an example:

“` python

import PyPDF2

with open(‘example.pdf’, ‘rb’) as pdf_file:

pdf_reader = PyPDF2. PdfFileReader(pdf_file)

page = pdf_reader.getPage(0)

text = page.extractText()

print(text)

“`

Rotate PDF File Pages

Suppose you have an auto-generated PDF from a system, and all the pages are in a landscape format, while they are better viewed in Portrait. Heres how you can rotate the pages in your file:

“` python

import PyPDF2

with open(‘example.pdf’, ‘rb’) as pdf_file:

pdf_reader = PyPDF2. PdfFileReader(pdf_file)

pdf_writer = PyPDF2.

PdfFileWriter()

for page in range(pdf_reader.getNumPages()):

current_page = pdf_reader.getPage(page)

pdf_writer.addPage(current_page.rotateClockwise(90))

output_file = open(‘rotated_example.pdf’, ‘wb’)

pdf_writer.write(output_file)

“`

Merge PDF Files

You might have multiple files related to a project and want to combine them into a single document before sharing. Heres how you can do that:

“` python

import PyPDF2

import contextlib

pdf_files = [‘file1.pdf’, ‘file2.pdf’, ‘file3.pdf’]

merged_pdf_file = ‘merged_files.pdf’

with contextlib.ExitStack() as stack:

files = [stack.enter_context(open(pdf, ‘rb’)) for pdf in pdf_files]

pdf_merger = PyPDF2. PdfFileMerger()

for file in files:

pdf_reader = PyPDF2.

PdfFileReader(file)

pdf_merger.append(pdf_reader)

output_file = open(merged_pdf_file, ‘wb’)

pdf_merger.write(output_file)

“`

Split PDF Files into Single Pages Files

You might have a PDF file with multiple pages, and you want each page as a separate file. Heres how you can do that using PyPDF2:

“` python

import PyPDF2

pdf_file = open(‘example.pdf’, ‘rb’)

pdf_reader = PyPDF2. PdfFileReader(pdf_file)

for page in range(pdf_reader.getNumPages()):

pdf_writer = PyPDF2.

PdfFileWriter()

current_page = pdf_reader.getPage(page)

pdf_writer.addPage(current_page)

output_filename = f’extracted_page_{page + 1}.pdf’

with open(output_filename, “wb”) as out:

pdf_writer.write(out)

“`

Extracting Images from PDF Files

You might want to extract an image from a PDF document to use it somewhere else. Here is an example:

“` python

import PyPDF2

from PIL import Image

with open(‘document.pdf’, ‘rb’) as pdf_file:

pdf_reader = PyPDF2. PdfFileReader(pdf_file)

page = pdf_reader.getPage(0)

xObject = page[‘/Resources’][‘/XObject’].getObject()

for obj in xObject:

if xObject[obj][‘/Subtype’] == ‘/Image’:

size = (xObject[obj][‘/Width’], xObject[obj][‘/Height’])

data = xObject[obj].getData()

mode = ”

if xObject[obj][‘/ColorSpace’] == ‘/DeviceRGB’:

mode = ‘RGB’

else:

mode = ‘P’

image = Image.frombytes(mode, size, data)

image.save(f'{obj}.png’)

“`

Conclusion

PyPDF2 is a powerful Python library that allows you to process PDF files programmatically through a wide range of features that it providesfrom extracting metadata and content, splitting or merging files, or even extracting images from your files. You can now work with PDF documents within your workflows and automate repetitive tasks without leaving the comfort of your Python environment.

Hopefully, this article provided a good introduction to PyPDF2, through its features, and some examples of how to apply it in various situations. In this article, we have explored the primary features of PyPDF2 a powerful Python library to create, manipulate, and extract valuable information from PDF files.

In this following section, we will take a more in-depth look at some of the key functionalities, namely, the PyPDF2 classes

PdfFileReader,

PdfFileWriter, and

PdfFileMerger, and explore various use-cases of the Pillow module, binary mode, getPage(), extractText(), rotateClockwise(), getNumPages(), and merge and split.

PyPDF2 Classes

PyPDF2 has three main classes that you use to do most of the work

PdfFileReader,

PdfFileWriter, and

PdfFileMerger. Here’s what they do:

PdfFileReader: Read the pages and metadata in a PDF file.

PdfFileWriter: Create new PDF files or Edit the existing ones. –

PdfFileMerger: Combine one or more PDF files into a single document.

PdfFileReader

The

PdfFileReader class is used to read a PDF file. You can use it to read the number of pages in a PDF, get the contents of a page, and extract metadata.

Heres how you can use it:

“` python

import PyPDF2

with open(‘example.pdf’, ‘rb’) as pdf_file:

pdf_reader = PyPDF2. PdfFileReader(pdf_file)

# Get number of pages

num_pages = pdf_reader.getNumPages()

print(f’Number of pages: {num_pages}n’)

# Get content from first page

first_page = pdf_reader.getPage(0)

print(first_page.extractText())

# Get metadata from a PDF

metadata = pdf_reader.getDocumentInfo()

print(metadata)

“`

In the sample code above, we opened the example.pdf file in binary mode so that we can read and write to it.

Next, we created an instance of the

PdfFileReader class, which allowed us to extract the number of pages, the first page of the PDF, and metadata.

PdfFileWriter

If you want to edit an existing PDF or create a new one from scratch, you can use

PdfFileWriter. Heres how it works:

“` python

import PyPDF2

with open(‘example.pdf’, ‘rb’) as pdf_file:

pdf_reader = PyPDF2. PdfFileReader(pdf_file)

pdf_writer = PyPDF2.

PdfFileWriter()

# Append a page to the end of the PDF

pdf_writer.addPage(pdf_reader.getPage(0))

# Rotate the page 90 degrees clockwise

page = pdf_reader.getPage(1)

page.rotateClockwise(90)

# Add the page to PDF Writer

pdf_writer.addPage(page)

# Write the resulting PDF to a file

with open(‘output.pdf’, ‘wb’) as output_pdf:

pdf_writer.write(output_pdf)

“`

In the example above, we opened the sample PDF file in binary read mode and used the

PdfFileReader class to read the first page. We used the

PdfFileWriter class to create a new PDF file and added the first and second rotated pages to it.

Finally, we wrote the modified PDF to a new output file.

PdfFileMerger

The

PdfFileMerger class is used to merge two or more PDF files into a single PDF. Heres how it works:

“` python

import PyPDF2

pdf_files = [‘file1.pdf’, ‘file2.pdf’, ‘file3.pdf’]

(pdf_reader.getP)

pdf_merger = PyPDF2. PdfFileMerger()

for file in pdf_files:

with open(file, ‘rb’) as pdf_file:

pdf_merger.append(fileobj=pdf_file)

with open(‘merged_pdf.pdf’, ‘wb’) as output_file:

pdf_merger.write(output_file)

“`

In the example, we created

PdfFileMerger object and iterated through list of files to add them to the merger.

Once weve added all the PDF files, we specified the output file name and wrote the combined PDF to a new file.

Pillow Module

Pillow is a Python library that adds support for opening, manipulating, and saving many image file formats such as JPEG, PNG, BMP, and TIFF. It is useful when working with PDF files because you can extract images from the document and use them in your application.

“` python

import PyPDF2

from PIL import Image

with open(‘document.pdf’, ‘rb’) as pdf_file:

pdf_reader = PyPDF2. PdfFileReader(pdf_file)

page = pdf_reader.getPage(0)

xObject = page[‘/Resources’][‘/XObject’].getObject()

for obj in xObject:

if xObject[obj][‘/Subtype’] == ‘/Image’:

size = (xObject[obj][‘/Width’], xObject[obj][‘/Height’])

data = xObject[obj].getData()

mode = ”

if xObject[obj][‘/ColorSpace’] == ‘/DeviceRGB’:

mode = ‘RGB’

else:

mode = ‘P’

image = Image.frombytes(mode, size, data)

image.save(f'{obj}.png’)

“`

In the example above, the page object retrieves the first page of the PDF document.

We then loop through the ‘/XObject’ objects to find the images in the current XObject, the images are then converted to PIL Image format and finally saved to disk.

Binary Mode

When working with PDF files, you typically need to open them in binary mode. This is because PDF files contain large amounts of binary data and are not plain text files.

Here’s how you can work with PDF files in binary mode:

“` python

with open(‘example.pdf’, ‘rb’) as pdf_file:

pdf_reader = PyPDF2. PdfFileReader(pdf

Popular Posts