Adventures in Machine Learning

Efficiently manipulating SQL data with Pandas read_sql() function

Pandas is a powerful open-source library for data manipulation, analysis, and visualization in Python. It provides easy-to-use functions to read and write data from various sources, including SQL databases like SQLite.

In this article, we will explore the Pandas read_sql() function, which allows us to read data from an SQL database into a Pandas DataFrame. We will cover the prerequisites you need to know before using this function, its syntax, and how to implement it.

Prerequisites

Before using the Pandas read_sql() function, you need to have knowledge of Python programming, Pandas library, SQLAlchemy, and SQLite.

  • Python is a popular high-level programming language used for various tasks, including data manipulation and analysis.
  • Pandas is a library built on Python that provides functions and classes for data manipulation and analysis.
  • SQLAlchemy is a Python library that provides a set of high-level API for connecting to various SQL databases.
  • SQLite is a lightweight SQL database engine used for small-scale applications.

Syntax of Pandas read_sql()

The syntax for Pandas read_sql() function is as follows:

pandas.read_sql(sql, con, index_col=None, coerce_float=True, params=None, parse_dates=None, columns=None, chunksize=None)
  • sql: A string containing an SQL query to execute or a table name to read from.
  • con: SQLAlchemy connection or a string containing a database connection URI.
  • index_col: A column name or a list of column names to set as the index of the DataFrame.
  • coerce_float: A boolean value that indicates whether to convert numeric values to float or not. Default is True.
  • params: A dictionary of values to pass as parameters for parameterized queries.
  • parse_dates: A column name or a list of column names to parse as datetime values.
  • columns: A list of column names to select from the query result. Default is to select all columns.
  • chunksize: An integer value that specifies the number of rows to read from the database at a time. Default is None, which means to read all rows at once.

Implementing Pandas read_sql()

To demonstrate how to use the Pandas read_sql() function, we will first create a database and a table using SQLite and then insert data into the table.

Creating a database and table

import sqlite3
# Create a new SQLite database
database = 'students.db'
conn = sqlite3.connect(database)
# Create a new table for students
table = 'students'
query = f"""
CREATE TABLE {table} (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    name TEXT NOT NULL,
    age INTEGER NOT NULL,
    gender TEXT NOT NULL
)
"""
conn.execute(query)

In the above code, we create a new SQLite database named ‘students.db’ and connect to it. Then we create a new table named ‘students’ with four columns: ‘id’, ‘name’, ‘age’, and ‘gender’ using the CREATE TABLE SQL statement. The ‘id’ column is the primary key of the table and is automatically incremented for each new row added to the table.

Inserting data into the table

# Insert some data into the table
data = [
    ('John', 21, 'Male'),
    ('Alice', 19, 'Female'),
    ('Bob', 22, 'Male'),
    ('Eve', 20, 'Female')
]
query = f"""
INSERT INTO {table} (name, age, gender)
VALUES (?, ?, ?)
"""
conn.executemany(query, data)
conn.commit()

In the above code, we define a list of tuples containing the data to be inserted into the ‘students’ table. Then we use the INSERT INTO SQL statement with placeholders (?) to insert the data into the table. The executemany() function executes the INSERT statement with each tuple of data in the list. Finally, we commit the changes to the database using the commit() function.

Reading data from the table

import pandas as pd
# Read data from the table into a DataFrame
query = f"""
SELECT *
FROM {table}
"""
df = pd.read_sql(query, conn)
# Print the DataFrame
print(df)

In the above code, we define an SQL query to select all columns and rows from the ‘students’ table using the SELECT SQL statement. Then we use the read_sql() function to execute the query and read the data into a Pandas DataFrame. Finally, we print the DataFrame to the console.

Output:

   id   name  age  gender
0   1   John   21    Male
1   2  Alice   19  Female
2   3    Bob   22    Male
3   4    Eve   20  Female

Conclusion

In this article, we have learned about the Pandas read_sql() function, which allows us to read data from an SQL database into a Pandas DataFrame. We have covered the prerequisites, such as Python programming, Pandas library, SQLAlchemy, and SQLite, that are needed to use this function. We have also discussed the syntax of the read_sql() function and how to implement it. By creating a SQLite database, creating a table, inserting data, and reading data into a Pandas DataFrame, we have demonstrated how to use the read_sql() function.

Advanced functionalities of Pandas read_sql()

In addition to reading data from an SQL database into a Pandas DataFrame, the Pandas read_sql() function also provides advanced functionalities that allow us to customize the output. In this section, we will cover how to add an index column to the DataFrame, specify columns for the output, and filter rows using SQL condition clauses.

Adding an index column to the DataFrame

By default, when we read data from an SQL table using the Pandas read_sql() function, the DataFrame does not have an index column. However, we can specify a column to be used as the index of the DataFrame using the index_col parameter.

# Read data from the table into a DataFrame with an index column
query = f"""
SELECT *
FROM {table}
"""
df = pd.read_sql(query, conn, index_col='id')
# Print the DataFrame
print(df)

In the above code, we specify the ‘id’ column of the ‘students’ table to be used as the index of the DataFrame by passing index_col=’id’ as a parameter to the read_sql() function. Now, the DataFrame has an index column, and we can use it to select, filter, or merge data.

Output:

    name  age  gender
id                  
1    John   21    Male
2   Alice   19  Female
3     Bob   22    Male
4     Eve   20  Female

Specifying columns for the output

Sometimes, we may not want to include all columns of an SQL table in the DataFrame. We can specify the names of the columns we want to include in the output using the columns parameter. This parameter accepts a list of column names.

# Read data from the table into a DataFrame with selected columns
query = f"""
SELECT name, age
FROM {table}
"""
df = pd.read_sql(query, conn, columns=['name', 'age'])
# Print the DataFrame
print(df)

In the above code, we define an SQL query to select only the ‘name’ and ‘age’ columns from the ‘students’ table using the SELECT SQL statement. Then we pass a list of column names [‘name’, ‘age’] as a parameter to the read_sql() function using columns=[‘name’, ‘age’]. Now, the DataFrame only contains the two columns we specified.

Output:

    name   age
0   John   21
1  Alice   19
2    Bob   22
3    Eve   20

Filtering rows using SQL condition clauses

Often, we need to filter the rows of an SQL table based on some criteria. We can use the WHERE SQL clause to specify a condition for filtering the rows. We pass the condition as a string to the read_sql() function.

# Read data from the table into a DataFrame with filtered rows
query = f"""
SELECT *
FROM {table}
WHERE age > 20
"""
df = pd.read_sql(query, conn)
# Print the DataFrame
print(df)

In the above code, we define an SQL query to select all columns and rows from the ‘students’ table where the ‘age’ column is greater than 20 using the SELECT and WHERE SQL clauses. Then we pass the query string to the read_sql() function. Now, the DataFrame only contains the rows that satisfy the condition.

Output:

   id  name  age gender
0   1  John   21   Male
1   3   Bob   22   Male

We can use other comparison operators like <, <=, >, >=, and !=, and logical operators like AND and OR to create more complex conditions for filtering the rows.

# Read data from the table into a DataFrame with complex conditions
query = f"""
SELECT *
FROM {table}
WHERE gender = 'Male' AND marks > 60
"""
df = pd.read_sql(query, conn)
# Print the DataFrame
print(df)

In the above code, we define an SQL query to select all columns and rows from the ‘students’ table where the ‘gender’ column is ‘Male’ and the ‘marks’ column is greater than 60 using the SELECT and WHERE SQL clauses. Then we pass the query string to the read_sql() function. Now, the DataFrame only contains the rows that satisfy the complex condition.

Output:

   id  name  age gender  marks
0   1  John   21   Male     80
1   3   Bob   22   Male     70

Conclusion

In this section, we covered advanced functionalities of the Pandas read_sql() function, such as adding an index column to the DataFrame, specifying columns for the output, and filtering rows using SQL condition clauses. We learned how to customize the output to meet our needs. By using these functionalities, we can read and manipulate data from an SQL database more efficiently and effectively. In conclusion, the Pandas read_sql() function is a powerful tool that enables us to read data from an SQL database into a Pandas DataFrame seamlessly. It comes with advanced functionalities that allow us to manipulate the output more efficiently and effectively. This article covered the prerequisites, syntax, and implementation of the read_sql() function, and also explored additional functionalities like adding an index column, specifying columns for the output, and filtering rows using SQL condition clauses. By applying these functionalities in practice, we can customize the output to meet our needs and perform data manipulation and analysis tasks with ease.

Popular Posts