Exploring PDF Files – Reading and Extracting Text
PDF (Portable Document Format) files are widely used in many fields, from education to business. This file format has gained popularity because it is an efficient way to distribute information, as it preserves the formatting of the document and it can be viewed on various devices without compromising the layout.
PDF files are also a good option for storing and sharing digital documents, as they are difficult to modify. In this article, we will explore how to read and extract text from PDF files using Python.
Opening PDF files with pypdf
Python is a versatile programming language that can be used for various applications, including PDF file manipulation. To read PDF files using Python, we can use different libraries, and one of them is PyPDF2.
PyPDF2 is a Python library that provides a wide range of functionalities for working with PDF files. To use this library, we need to install it first by running the following command in our terminal:
pip install PyPDF2
After installing the library, we can start reading PDF files. Here’s how we can open a PDF file using PyPDF2:
import PyPDF2
pdf_file = open('my_file.pdf', 'rb')
pdf_reader = PyPDF2.PdfFileReader(pdf_file)
In the code snippet above, we first open the PDF file using the built-in function open(). The second argument, ‘rb’, tells Python to open the file in binary mode.
Then, we create a PdfFileReader object by passing the opened file to PyPDF2.PdfFileReader().
Extracting text from a page with PageObject.extract_text()
Once we have opened a PDF file using PyPDF2, we can start reading the contents of the file.
To extract text from a specific page in a PDF file, we can use PageObject.extract_text(). Here’s an example of how to extract text from the first page of a PDF file:
page_obj = pdf_reader.getPage(0)
page_text = page_obj.extract_text()
print(page_text)
In the code snippet above, we first get the first page object using getPage(0) method. Then, we extract the text from the page object using extract_text() method.
Extracting text from all pages
If we want to extract text from all the pages in a PDF file, we can loop through all the page objects and extract their text. Here’s how we can do it:
for page_num in range(pdf_reader.numPages):
page_obj = pdf_reader.getPage(page_num)
page_text = page_obj.extract_text()
print(page_text)
In the code snippet above, we use the numPages attribute to get the total number of pages in the PDF file.
Then, we loop through all the pages and extract their text using getPage() and extract_text() methods.
Saving extracted text to a .txt file
We can save the extracted text to a text file using Python’s built-in Path class.
Here’s how we can do it:
from pathlib import Path
output_file = Path('output.txt')
with output_file.open(mode='w', encoding='utf-8') as f:
for page_num in range(pdf_reader.numPages):
page_obj = pdf_reader.getPage(page_num)
page_text = page_obj.extract_text()
f.write(page_text)
In the code snippet above, we create a Path object with the name of the output file. Then, we open the file in write mode and loop through all the pages to extract their text.
Finally, we write the extracted text to the file using the write() method.
Retrieving Pages and Writing to PDF Files
In addition to reading PDF files, we may also need to create or modify them. PyPDF2 provides functionalities for creating, modifying, and writing PDF files.
Let’s explore some of these functionalities.
Creating new PDF files with PdfWriter
To create a new PDF file, we can use PyPDF2.PdfFileWriter() class. Here’s an example of how we can create a new PDF file with one blank page:
pdf_writer = PyPDF2.PdfFileWriter()
pdf_writer.addBlankPage(width=595, height=842) # a4 size page
with open('new_file.pdf', 'wb') as f:
pdf_writer.write(f)
In the code snippet above, we create a PdfFileWriter object and add a blank page to it using addBlankPage() method.
The width and height arguments specify the size of the page in points. 1 point is equivalent to 1/72 inch.
Finally, we write the PdfFileWriter object to a new file using the write() method.
Adding existing PageObject instances to PdfWriter
To add existing pages to a PDF file, we can use PyPDF2.PdfFileWriter() class. Here’s an example of how to add an existing page to a new PDF file:
pdf_reader = PyPDF2.PdfFileReader(open('my_file.pdf', 'rb'))
pdf_writer = PyPDF2.PdfFileWriter()
page_obj = pdf_reader.getPage(0)
pdf_writer.addPage(page_obj)
with open('new_file.pdf', 'wb') as f:
pdf_writer.write(f)
In the code snippet above, we open an existing PDF file and get the first page using getPage() method.
Then, we create a PdfFileWriter object and add the page object to it using addPage() method. Finally, we write the PdfFileWriter object to a new file.
Extracting a single page from a PDF file
If we want to extract a specific page from a PDF file and save it as a new file, we can use PyPDF2.PdfFileWriter() class. Here’s how we can do it:
pdf_reader = PyPDF2.PdfFileReader(open('my_file.pdf', 'rb'))
pdf_writer = PyPDF2.PdfFileWriter()
page_obj = pdf_reader.getPage(0)
pdf_writer.addPage(page_obj)
with open('new_file.pdf', 'wb') as f:
pdf_writer.write(f)
In the code snippet above, we open an existing PDF file and get the first page using getPage() method.
Then, we create a PdfFileWriter object and add the page object to it using addPage() method. Finally, we write the PdfFileWriter object to a new file.
Extracting multiple pages from a PDF file
If we want to extract multiple pages from a PDF file and save them as a new file, we can use PyPDF2.PdfFileWriter() class in combination with a for loop. Here’s how we can do it:
pdf_reader = PyPDF2.PdfFileReader(open('my_file.pdf', 'rb'))
pdf_writer = PyPDF2.PdfFileWriter()
for page_num in range(0, 3): # extract the first 3 pages
page_obj = pdf_reader.getPage(page_num)
pdf_writer.addPage(page_obj)
with open('new_file.pdf', 'wb') as f:
pdf_writer.write(f)
In the code snippet above, we open an existing PDF file and loop through the first three pages using a for loop.
Then, we create a PdfFileWriter object and add the page objects to it using addPage() method. Finally, we write the PdfFileWriter object to a new file.
Conclusion
In this article, we have explored how to read and extract text from PDF files using PyPDF2 library in Python. We have also looked at various functionalities for creating, modifying, and writing PDF files.
By mastering these techniques, you can quickly automate your PDF-related tasks and save time. The possibilities are limitless, from generating invoices and contracts to analyzing research papers and reports.
The key is to experiment and find the right approach that suits your needs.
Concatenating and Merging PDF Files
PDF files are excellent for document preservation, but they can be cumbersome when trying to work with multiple PDF files. With PyPDF2 library in Python, concatenating and merging PDF files become a straightforward process.
Follow along to learn how to merge and concatenate PDF files using PyPDF2.
Merging PDF files with PdfMerger
To merge two or more PDF files into a single document, we use the PyPDF2 PdfMerger class. Here’s how we can merge two PDF files:
from PyPDF2 import PdfFileMerger
pdf_merger = PdfFileMerger()
with open('file1.pdf', 'rb') as f:
pdf_merger.append(f)
with open('file2.pdf', 'rb') as f:
pdf_merger.append(f)
with open('merged_file.pdf', 'wb') as f:
pdf_merger.write(f)
In the code snippet above, we first import the PdfFileMerger class from PyPDF2. Then, we create a PdfFileMerger object and use the append() method to add the PDF files we want to merge.
Finally, we write the merged PDF file using the write() method. Note that PdfFileMerger stores the entire PDF files in memory while merging them.
Therefore, it may not be suitable for merging large PDF files. In such cases, we can use PdfReader to extract only the pages we need and merge them using PdfWriter.
Concatenating PDF files with .append()
To concatenate PDF files, we can use the same PdfMerger class and append each page from the source PDF files to the output PDF file. Here’s how we can concatenate two PDF files:
from PyPDF2 import PdfFileReader, PdfFileWriter
pdf_writer = PdfFileWriter()
with open('file1.pdf', 'rb') as f:
pdf_reader = PdfFileReader(f)
for page_num in range(pdf_reader.getNumPages()):
page_obj = pdf_reader.getPage(page_num)
pdf_writer.addPage(page_obj)
with open('file2.pdf', 'rb') as f:
pdf_reader = PdfFileReader(f)
for page_num in range(pdf_reader.getNumPages()):
page_obj = pdf_reader.getPage(page_num)
pdf_writer.addPage(page_obj)
with open('concatenated_file.pdf', 'wb') as f:
pdf_writer.write(f)
In the code snippet above, we first create a PdfFileWriter object to hold the concatenated pages. Then, we loop through each PDF file using a for loop and PdfFileReader to extract pages.
Finally, we add each page to the PdfFileWriter object and write it to a file using write() method.
Rotating and Cropping Pages
PDF files can sometimes require rotation or cropping to achieve a better layout. In PyPDF2, we can rotate and crop each page as needed.
Let’s explore how it’s done.
Rotating pages with PageObject.rotate()
To rotate a page, we can use the PageObject.rotate() method.
Here’s how we can rotate a page by 90 degrees:
from PyPDF2 import PdfFileReader, PdfFileWriter
pdf_reader = PdfFileReader(open('file1.pdf', 'rb'))
pdf_writer = PdfFileWriter()
page_obj = pdf_reader.getPage(0)
page_obj.rotateClockwise(90) # Rotate 90 degrees Clockwise
pdf_writer.addPage(page_obj)
with open('rotated_pdf.pdf', 'wb') as f:
pdf_writer.write(f)
In the code snippet above, we first open the PDF file using PdfFileReader. Then, we get the first page using getPage() method and rotate it clockwise by 90 degrees using rotateClockwise() method.
Finally, we add the rotated page to the PdfFileWriter object and write it to a file.
Cropping pages with RectangleObject
To crop a PDF page, we use the PdfReader class to extract the crop box of a page and manipulate it using RectangleObject. Here’s how we can crop a page from the left:
from PyPDF2 import PdfFileReader, PdfFileWriter
from PyPDF2.pdf import RectangleObject
pdf_reader = PdfFileReader(open('file1.pdf', 'rb'))
pdf_writer = PdfFileWriter()
page_num = 0 # first page
page_obj = pdf_reader.getPage(page_num)
crop_box = page_obj.cropBox
new_crop_box = RectangleObject(
x1=crop_box.getLowerLeft()[0] + 100, # increase from the left
y1=crop_box.getLowerLeft()[1],
x2=crop_box.getUpperRight()[0],
y2=crop_box.getUpperRight()[1]
)
page_obj.cropBox = new_crop_box
pdf_writer.addPage(page_obj)
with open('cropped_pdf.pdf', 'wb') as f:
pdf_writer.write(f)
In the code snippet above, we first open the PDF file using PdfFileReader. Then, we get the first page using getPage() method and get its crop box using cropBox attribute.
We create a new crop box by manipulating the x1 value to increase from the left. Finally, we modify the page’s crop box using the cropBox attribute and add it to the PdfFileWriter object before writing it to a file.
Conclusion
In this article, we’ve learned how to concatenate and merge PDF files using PyPDF2. We’ve also covered how to rotate and crop PDF pages to achieve a better layout.
By mastering these techniques, PDF files become much easier to manage, and we can quickly automate our workflow using Python. However, keep in mind that PyPDF2 has limitations in handling complex PDF files, such as those with encryption or embedded fonts.
Therefore, it’s essential to test the code on a sample of your PDF files before scaling it up to ensure its compatibility.
Encrypting and Decrypting PDF Files
PDF files often contain sensitive information that requires protection and encryption. PyPDF2 enables encryption and decryption of PDF files with its PdfWriter.encrypt() and PdfReader.decrypt() methods, respectively.
In this section, we will explore how to use these methods to encrypt and decrypt PDF files with PyPDF2.
Encrypting PDFs with PdfWriter.encrypt()
To encrypt a PDF file, we can use the PdfWriter.encrypt() method.
Here’s how we can encrypt a PDF file with a password:
from PyPDF2 import PdfFileReader, PdfFileWriter
pdf_reader = PdfFileReader(open('my_file.pdf', 'rb'))
pdf_writer = PdfFileWriter()
for page_num in range(pdf_reader.getNumPages()):
pdf_writer.addPage(pdf_reader.getPage(page_num))
pdf_writer.encrypt(user_pwd='user', owner_pwd='owner', use_128bit=True)
with open('encrypted_pdf.pdf', 'wb') as f:
pdf_writer.write(f)
In the code snippet above, we first read the PDF file using PdfFileReader and get its pages using getNumPages() and getPage() methods. Then, we add the pages to a PdfFileWriter object and encrypt them using the encrypt() method.
We provide a user password ‘user’ and an owner password ‘owner’ to control the access rights of the PDF. Finally, we write the encrypted PDF file to a new file using the write() method.
Decrypting PDFs with PdfReader.decrypt()
To decrypt and read an encrypted PDF file, we can use the PdfReader.decrypt() method. Here’s how we can decrypt a PDF file:
from PyPDF2 import PdfFileReader
pdf_reader = PdfFileReader(open('encrypted_pdf.pdf', 'rb'))
if pdf_reader.isEncrypted:
pdf_reader.decrypt(owner_pwd='owner') # provide the owner password
for page_num in range(pdf_reader.getNumPages()):
print(pdf_reader.getPage(page_num).extractText())
In the code snippet above, we first read the encrypted PDF file using PdfFileReader and check if it’s encrypted using isEncrypted attribute. Then, we decrypt the PDF file using the decrypt() method and providing the necessary password.
Finally, we use getPage() and extractText() methods to read each page’s text.
Creating PDF Files with Python and ReportLab
ReportLab is a powerful third-party library that enables the creation of PDF files using Python. With ReportLab, you can generate PDF files with complex layouts, including text, charts, images, and tables.
In this section, we will explore how