Splitting a Pandas DataFrame into Multiple DataFrames
Have you ever found yourself dealing with a massive dataset that seems impossible to manage? Perhaps you only want to work with a specific section of the data that is relevant to your analysis or task.
In this article, we explore how to split a Pandas DataFrame into multiple DataFrames, making it easier to work with and analyze the data.
Splitting into Two DataFrames
Let us start with the simpler task of splitting a DataFrame into two separate DataFrames. This can come in handy when you want to separate the data based on certain criteria, such as separating data from different time periods or separating data of different types.
To split a DataFrame into two parts, we can use the following code:
df1 = df.iloc[:n]
df2 = df.iloc[n:]
Where n
is the index at which we want to split the DataFrame. df1
will contain the rows from the beginning up to index n-1
, while df2
will contain the rows from index n
up to the end of the DataFrame.
For example, let us create a simple DataFrame with some random data:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'A': [1, 2, 3, 4, 5],
'B': ['a', 'b', 'c', 'd', 'e'],
'C': np.random.rand(5)
})
print(df)
Output:
A B C
0 1 a 0.144214
1 2 b 0.782830
2 3 c 0.206456
3 4 d 0.808581
4 5 e 0.839471
Now let us split this DataFrame into two parts, with the first two rows in one DataFrame and the remaining rows in another:
df1 = df.iloc[:2]
df2 = df.iloc[2:]
print(df1)
print(df2)
Output:
A B C
0 1 a 0.144214
1 2 b 0.782830
A B C
2 3 c 0.206456
3 4 d 0.808581
4 5 e 0.839471
As you can see, we have effectively split the original DataFrame into two separate DataFrames, df1
and df2
, based on the index value n=2
.
Splitting into Multiple DataFrames
If you need to split a DataFrame into multiple parts, you can use the numpy.array_split
function. This function splits arrays or DataFrames into multiple sub-arrays or sub-DataFrames along a specified axis.
The basic syntax for this function is as follows:
parts = np.array_split(df, num)
Where num
is the number of parts to split the DataFrame into. The resulting object parts
will be a list of DataFrames, each containing a fraction of the original DataFrame.
For example, let us create another DataFrame with some random data:
df = pd.DataFrame({
'A': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'B': ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j'],
'C': np.random.rand(10),
'D': np.random.randn(10)
})
print(df)
Output:
A B C D
0 1 a 0.469512 0.399789
1 2 b 0.224657 -0.000644
2 3 c 0.943547 -0.626233
3 4 d 0.839175 1.154726
4 5 e 0.522775 -0.188685
5 6 f 0.992727 -1.007110
6 7 g 0.354648 0.292593
7 8 h 0.541027 -0.143968
8 9 i 0.811682 1.316059
9 10 j 0.047342 -1.311120
Now let us split this DataFrame into three parts, with each part containing three rows:
parts = np.array_split(df, 3)
print(parts[0])
print(parts[1])
print(parts[2])
Output:
A B C D
0 1 a 0.469512 0.399789
1 2 b 0.224657 -0.000644
2 3 c 0.943547 -0.626233
A B C D
3 4 d 0.839175 1.154726
4 5 e 0.522775 -0.188685
5 6 f 0.992727 -1.007110
A B C D
6 7 g 0.354648 0.292593
7 8 h 0.541027 -0.143968
8 9 i 0.811682 1.316059
9 10 j 0.047342 -1.311120
As you can see, we have now split the original DataFrame into three separate DataFrames, each with three rows.
Viewing Resulting DataFrames
It is always a good idea to visualize the resulting DataFrames to ensure that the splitting process was successful. One way to do this is by using the head()
and tail()
functions.
The head()
function displays the first n
rows of the DataFrame, while the tail()
function displays the last n
rows. By default, n=5
.
For example, let us view the first two rows of df1
and df2
from the earlier example:
print(df1.head(2))
print(df2.head(2))
Output:
A B C
0 1 a 0.144214
1 2 b 0.782830
A B C
2 3 c 0.206456
3 4 d 0.808581
As you can see, df1
contains the first two rows of the original DataFrame, while df2
contains the remaining three rows.
Conclusion
In this article, we explored how to split a Pandas DataFrame into multiple DataFrames based on certain criteria. We learned how to split a DataFrame into two parts using indexing, and how to use the numpy.array_split
function to split a DataFrame into multiple parts.
By breaking down larger datasets into smaller, more manageable DataFrames, we can make our data analysis and manipulation tasks more efficient and accurate. Example 2: Split Pandas DataFrame into Multiple DataFrames
In this example, we will create a DataFrame with some student records and split it into multiple DataFrames based on the department of each student.
Step 1: Creating a DataFrame
Let us create a DataFrame with some student records. We will include columns for the student ID, name, age, GPA, and department.
import pandas as pd
records = {
'ID': [1001, 1002, 1003, 1004, 1005, 1006, 1007, 1008],
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Emma', 'Fiona', 'George', 'Hannah'],
'Age': [19, 20, 19, 20, 21, 22, 20, 21],
'GPA': [3.2, 3.5, 3.9, 3.4, 3.7, 3.8, 3.1, 3.6],
'Department': ['CS', 'EC', 'CS', 'EC', 'CS', 'EC', 'CS', 'EC']
}
df = pd.DataFrame(records)
print(df)
Output:
ID Name Age GPA Department
0 1001 Alice 19 3.2 CS
1 1002 Bob 20 3.5 EC
2 1003 Charlie 19 3.9 CS
3 1004 David 20 3.4 EC
4 1005 Emma 21 3.7 CS
5 1006 Fiona 22 3.8 EC
6 1007 George 20 3.1 CS
7 1008 Hannah 21 3.6 EC
Step 2: Splitting into Multiple DataFrames
Now let us split this DataFrame into multiple DataFrames based on the department of each student. We will create a dictionary to hold each department’s DataFrame.
depts = {}
for dept in df.Department.unique():
depts[dept] = pd.DataFrame(df[df.Department == dept])
print(depts)
Output:
{
'CS':
ID Name Age GPA Department
0 1001 Alice 19 3.2 CS
2 1003 Charlie 19 3.9 CS
4 1005 Emma 21 3.7 CS
6 1007 George 20 3.1 CS,
'EC':
ID Name Age GPA Department
1 1002 Bob 20 3.5 EC
3 1004 David 20 3.4 EC
5 1006 Fiona 22 3.8 EC
7 1008 Hannah 21 3.6 EC
}
As you can see, we have created two DataFrames, one for each department. Each DataFrame contains only the records of students in that department.
Step 3: Viewing Resulting DataFrames
To check if the splitting process was successful, let us view the resulting DataFrames using the head()
function. “`
for dept, df_dept in depts.items():
print(f"nStudents in Department: {dept}")
print(df_dept.head())
Output:
Students in Department: CS
ID Name Age GPA Department
0 1001 Alice 19 3.2 CS
2 1003 Charlie 19 3.9 CS
4 1005 Emma 21 3.7 CS
6 1007 George 20 3.1 CS
Students in Department: EC
ID Name Age GPA Department
1 1002 Bob 20 3.5 EC
3 1004 David 20 3.4 EC
5 1006 Fiona 22 3.8 EC
7 1008 Hannah 21 3.6 EC
As you can see, each DataFrame contains only the records of students in their respective departments.
Related Tutorials: Performing Common Functions with Pandas
To make data analysis and manipulation easier, Pandas provides a wide range of functions to perform common operations on DataFrames. Here are some of the most commonly used functions:
- Reading Data from CSV Files: The
read_csv()
function is used to read data from CSV files into a DataFrame. - Filtering Data: The
loc[]
andiloc[]
functions are used to filter DataFrames based on certain criteria. - Sorting Data: The
sort_values()
function is used to sort DataFrames based on one or more columns. - Grouping Data: The
groupby()
function is used to group DataFrames based on one or more columns. - Merging DataFrames: The
merge()
function is used to merge two or more DataFrames into a single DataFrame. - Pivot Tables: The
pivot_table()
function is used to create pivot tables from DataFrames. - Plotting Data: The
plot()
function is used to create plots and graphs from DataFrames.
By mastering these common functions, you can perform complex data analysis and manipulation tasks with ease using Pandas.
In this article, we explored how to split a Pandas DataFrame into multiple DataFrames based on specific criteria. We discussed two examples of splitting a DataFrame, one into two separate DataFrames and another into multiple DataFrames.
By breaking down large data sets into smaller, more manageable DataFrames, we can make data analysis and manipulation tasks more efficient and accurate. We also touched on some common functions of Pandas, such as filtering, sorting, grouping, merging, pivot tables, and plotting, that can simplify complex data analysis and manipulation tasks.
By mastering these techniques and functions, you can become an expert in handling large datasets, making your analysis and tasks more efficient.