Adventures in Machine Learning

Efficiently Manipulating Excel Files in Python with Pandas and Openpyxl

Upgrading Pandas and Using Openpyxl to Read Excel Files

Data analysis is an integral part of any scientific research, business, or industry. One of the most popular libraries in Python for data analysis is Pandas.

Pandas is an open-source library that provides powerful data structuring tools, data visualization, and data manipulation. However, there are times when the data is present in Excel files, and Pandas starts to struggle.

In such cases, it is best to use a specialized library like Openpyxl. This article covers how to upgrade Pandas, install Openpyxl, and use it to read Excel files.

Upgrading Pandas and Installing Openpyxl

Before we dive into the details of working with Openpyxl, let’s discuss how to upgrade Pandas and install Openpyxl. Upgrading to the latest version of Pandas is always recommended.

You can upgrade Pandas using pip, which is a package manager for Python. To upgrade Pandas using pip, run the following command in your terminal:

!pip install --upgrade pandas

Once you have upgraded Pandas, you can install Openpyxl.

You can install Openpyxl using pip as follows:

!pip install openpyxl

Using Openpyxl to Read Excel Files

Now that you have upgraded Pandas and installed Openpyxl, let’s see how to use Openpyxl to read Excel files. Openpyxl provides a module called “load_workbook” that we can use to load Excel files.

Once we load an Excel file, we can access its sheets using the “sheetnames” attribute and manipulate the data as required. Here’s an example of how to use Openpyxl to read an Excel file:

import openpyxl
# Load the Excel file
workbook = openpyxl.load_workbook("example.xlsx")
# Get sheetnames
sheet_names = workbook.sheetnames
# Loop through each sheetname
for sheet in sheet_names:
    # Get the sheet by name
    worksheet = workbook[sheet]
    # Loop through each row in the sheet
    for row in worksheet.iter_rows():
        # Loop through each cell in the row
        for cell in row:
            # Print the value of the cell
            print(cell.value)

As you can see, we loaded the example.xlsx file using the “load_workbook” module and accessed sheets using the “sheetnames” attribute. We then looped through each sheetname, accessed the sheet using its name using “workbook[sheet],” and iterated through each row and cell in the sheet.

Downgrading to version 1.X of xlrd (NOT RECOMMENDED)

If you are still using an old version of Pandas (Pandas < 1.2), you might have encountered the issue of not being able to read .xls files using Pandas. The issue occurs because Pandas relied on the xlrd library to read .xls files, and the newer version of xlrd removed support for .xls files.

If you do not want to upgrade to a newer version of Pandas, you can downgrade xlrd to version 1.x. However, this is not recommended as version 1.x is vulnerable to various security issues. Reading a .xls File

If you have an old .xls file that you need to read, you can use the xlrd library to read it.

xlrd is a third-party library used for reading Excel files (both .xls and .xlsx).

To use xlrd to read a .xls file, you can use the following code:

import xlrd
# Open the .xls file
workbook = xlrd.open_workbook("example.xls")
# Get sheetnames
sheet_names = workbook.sheet_names()
# Loop through each sheetname
for sheet in sheet_names:
    # Get the sheet by name
    worksheet = workbook.sheet_by_name(sheet)
    # Loop through each row in the sheet
    for row in range(worksheet.nrows):
        # Loop through each cell in the row
        for col in range(worksheet.ncols):
            # Get the value of the cell
            value = worksheet.cell(row, col).value
            # Print the value of the cell
            print(value)

The code above first opens the .xls file using the “open_workbook” method. It then gets the sheetnames using the “sheet_names” method, which returns a list of sheet names.

We then loop through each sheetname and access the sheet using the “sheet_by_name” method. We then loop through each row and cell in the sheet using the “nrows” and “ncols” attributes and read the value of each cell using the “cell” method.

Converting a .xls File to .xlsx

If you have an old .xls file that you need to use in Pandas, it is best to convert it to .xlsx. .xlsx is the newer format for Excel files and has better support for data analysis libraries like Pandas.

To convert a .xls file to .xlsx, you can use the following code in Python:

import pandas as pd
# Read the .xls file using xlrd
df = pd.read_excel("example.xls", engine="xlrd")
# Save the DataFrame to .xlsx
df.to_excel("example.xlsx", index=False)

The code above uses Pandas to read the .xls file using the “read_excel” method. It then saves the DataFrame to .xlsx using the “to_excel” method.

Setting the index parameter to False ensures that the index column is not saved to the .xlsx file.

Conclusion

In conclusion, upgrading Pandas and using Openpyxl to read Excel files is an efficient and effective way to manipulate data present in Excel files. It is always recommended to upgrade Pandas to prevent any issues and vulnerabilities.

Although downgrading xlrd to version 1.x is an option to read old .xls files, it is not recommended. Converting a .xls file to .xlsx is the best practice to use it in Pandas with better results.

So go ahead and use the best tools suited for the job and start exploring your data present in Excel files.

File Extension and Libraries

When it comes to working with Excel files in Python, there are two primary file extensions: .xlsx and .xls. .xlsx is the newer Excel file format while .xls is the older one.

These two file extensions require different libraries to read them in Python. Using Openpyxl for .xlsx Files

Openpyxl is a third-party library used to read and write Excel files in Python.

It is the preferred library for reading .xlsx files in Python. Openpyxl provides a simple and efficient way of reading and writing data to and from Excel files.

It is easy to learn and use, making it an excellent choice for beginners. To use Openpyxl to read an .xlsx file, you can use the following code:

import openpyxl
# Load the Excel file
workbook = openpyxl.load_workbook("example.xlsx")
# Get sheetnames
sheet_names = workbook.sheetnames
# Loop through each sheetname
for sheet in sheet_names:
    # Get the sheet by name
    worksheet = workbook[sheet]
    # Loop through each row in the sheet
    for row in worksheet.iter_rows():
        # Loop through each cell in the row
        for cell in row:
            # Print the value of the cell
            print(cell.value)

The code above loads the “example.xlsx” file using the “load_workbook” method and accesses sheets using the “sheetnames” attribute. We loop through each sheetname and access the sheet using its name using “workbook[sheet].” We then loop through each row and cell in the sheet using the “iter_rows” method and read the value of each cell using the “value” attribute.

Using Xlrd for .xls Files

Xlrd is a third-party library used to read .xls files in Python. Xlrd provides essential tools to read data from Excel files, making it an excellent choice for working with older Excel files.

However, it is not recommended for use with .xlsx files due to changes in the Excel file format. To use xlrd to read a .xls file, you can use the following code:

import xlrd
# Open the .xls file
workbook = xlrd.open_workbook("example.xls")
# Get sheetnames
sheet_names = workbook.sheet_names()
# Loop through each sheetname
for sheet in sheet_names:
    # Get the sheet by name
    worksheet = workbook.sheet_by_name(sheet)
    # Loop through each row in the sheet
    for row in range(worksheet.nrows):
        # Loop through each cell in the row
        for col in range(worksheet.ncols):
            # Get the value of the cell
            value = worksheet.cell(row, col).value
            # Print the value of the cell
            print(value)

The code above uses the “open_workbook” method to open the “example.xls” file and accesses sheets using the “sheet_names” method. We loop through each sheetname and access the sheet using the “sheet_by_name” method.

We then loop through each row and cell in the sheet using the “nrows” and “ncols” attributes and read the value of each cell using the “cell” method.

Checking the File Extension

Before reading an Excel file in Python, it is essential to check its file extension to determine which library to use. You can check the file extension using Python’s built-in “os” module as follows:

import os
# Get the file extension
file_extension = os.path.splitext("example.xlsx")[1]
# Check if the file extension is .xlsx
if file_extension == ".xlsx":
    # Use Openpyxl
    print("Reading Excel file using Openpyxl")
    # code to read .xlsx file using Openpyxl
else:
    # Use xlrd
    print("Reading Excel file using xlrd")
    # code to read .xls file using xlrd

The code above uses the “splitext” method from the “os” module to extract the file extension from the filename. We then check if the file extension is “.xlsx” using an “if” statement.

If the file extension is “.xlsx,” we use Openpyxl to read the file. Otherwise, we use xlrd to read the file.

Error Troubleshooting

When working with Excel files in Python, you may encounter errors such as “ImportError: No module named ‘xlrd'” or “UnsupportedFormatWarning: Unsupported file format, or corrupt file: Expected BOF record; found b’rn

General Fixes

  • Ensure that the required libraries such as Pandas, Openpyxl, and xlrd are installed.
  • Verify that the Excel file is not corrupt.
  • Check for any typos and syntax errors in your code.
  • Ensure that you have the appropriate permissions to read and write to the Excel file.

Checking for Pandas and Xlrd Versions

If you get an “ImportError: No module named ‘xlrd'” error, you may need to update your xlrd library or install the required version of xlrd that works with your version of Pandas. To check your Pandas and xlrd versions, you can use the following code:

import pandas as pd
import xlrd
print("Pandas version:", pd.__version__)
print("xlrd version:", xlrd.__VERSION__)

Setting the Engine Keyword Argument to Openpyxl

If you encounter an “UnsupportedFormatWarning: Unsupported file format, or corrupt file” error while using Pandas to read an Excel file, you can set the “engine” keyword argument to “openpyxl” to read the file using Openpyxl. Here’s an example:

import pandas as pd
# Read the .xlsx file using Pandas and Openpyxl engine
df = pd.read_excel("example.xlsx", engine="openpyxl")
# Print the DataFrame
print(df)

The code above sets the “engine” keyword argument to “openpyxl” to read the “example.xlsx” file using Openpyxl. The Pandas read_excel method uses the “engine” keyword argument to read Excel files using different libraries.

In conclusion, working with Excel files in Python involves using the right library for the file extension, troubleshooting errors, and checking for library versions. By following these best practices, you can efficiently work with Excel files in Python and perform data analysis tasks quickly and easily.

In conclusion, working with Excel files in Python requires the use of specific libraries depending on the file extension. Openpyxl is recommended for .xlsx files, while xlrd is used for .xls files.

It is essential to check the file extension before reading it and be mindful of version compatibility between libraries. Troubleshooting errors requires checking for outdated or corrupted files, using the correct syntax, and setting the engine keyword argument to openpyxl.

Excel files are a prevalent source of data, especially in business and scientific fields, and learning how to read and manipulate them in Python is essential. By following these guidelines, you can efficiently work with Excel files in Python and perform data analysis tasks quickly and easily.

Popular Posts