Adventures in Machine Learning

Efficiently Slice and Dice Your Data: Splitting a Pandas DataFrame into Multiple DataFrames

Splitting a Pandas DataFrame into Multiple DataFrames

Have you ever found yourself dealing with a massive dataset that seems impossible to manage? Perhaps you only want to work with a specific section of the data that is relevant to your analysis or task.

In this article, we explore how to split a Pandas DataFrame into multiple DataFrames, making it easier to work with and analyze the data.

Splitting into Two DataFrames

Let us start with the simpler task of splitting a DataFrame into two separate DataFrames. This can come in handy when you want to separate the data based on certain criteria, such as separating data from different time periods or separating data of different types.

To split a DataFrame into two parts, we can use the following code:

“`

df1 = df.iloc[:n]

df2 = df.iloc[n:]

“`

Where `n` is the index at which we want to split the DataFrame. `df1` will contain the rows from the beginning up to index `n-1`, while `df2` will contain the rows from index `n` up to the end of the DataFrame.

For example, let us create a simple DataFrame with some random data:

“`

import pandas as pd

import numpy as np

df = pd.DataFrame({

‘A’: [1, 2, 3, 4, 5],

‘B’: [‘a’, ‘b’, ‘c’, ‘d’, ‘e’],

‘C’: np.random.rand(5)

})

print(df)

“`

Output:

“`

A B C

0 1 a 0.144214

1 2 b 0.782830

2 3 c 0.206456

3 4 d 0.808581

4 5 e 0.839471

“`

Now let us split this DataFrame into two parts, with the first two rows in one DataFrame and the remaining rows in another:

“`

df1 = df.iloc[:2]

df2 = df.iloc[2:]

print(df1)

print(df2)

“`

Output:

“`

A B C

0 1 a 0.144214

1 2 b 0.782830

A B C

2 3 c 0.206456

3 4 d 0.808581

4 5 e 0.839471

“`

As you can see, we have effectively split the original DataFrame into two separate DataFrames, `df1` and `df2`, based on the index value `n=2`.

Splitting into Multiple DataFrames

If you need to split a DataFrame into multiple parts, you can use the `numpy.array_split` function. This function splits arrays or DataFrames into multiple sub-arrays or sub-DataFrames along a specified axis.

The basic syntax for this function is as follows:

“`

parts = np.array_split(df, num)

“`

Where `num` is the number of parts to split the DataFrame into. The resulting object `parts` will be a list of DataFrames, each containing a fraction of the original DataFrame.

For example, let us create another DataFrame with some random data:

“`

df = pd.DataFrame({

‘A’: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],

‘B’: [‘a’, ‘b’, ‘c’, ‘d’, ‘e’, ‘f’, ‘g’, ‘h’, ‘i’, ‘j’],

‘C’: np.random.rand(10),

‘D’: np.random.randn(10)

})

print(df)

“`

Output:

“`

A B C D

0 1 a 0.469512 0.399789

1 2 b 0.224657 -0.000644

2 3 c 0.943547 -0.626233

3 4 d 0.839175 1.154726

4 5 e 0.522775 -0.188685

5 6 f 0.992727 -1.007110

6 7 g 0.354648 0.292593

7 8 h 0.541027 -0.143968

8 9 i 0.811682 1.316059

9 10 j 0.047342 -1.311120

“`

Now let us split this DataFrame into three parts, with each part containing three rows:

“`

parts = np.array_split(df, 3)

print(parts[0])

print(parts[1])

print(parts[2])

“`

Output:

“`

A B C D

0 1 a 0.469512 0.399789

1 2 b 0.224657 -0.000644

2 3 c 0.943547 -0.626233

A B C D

3 4 d 0.839175 1.154726

4 5 e 0.522775 -0.188685

5 6 f 0.992727 -1.007110

A B C D

6 7 g 0.354648 0.292593

7 8 h 0.541027 -0.143968

8 9 i 0.811682 1.316059

9 10 j 0.047342 -1.311120

“`

As you can see, we have now split the original DataFrame into three separate DataFrames, each with three rows.

Viewing Resulting DataFrames

It is always a good idea to visualize the resulting DataFrames to ensure that the splitting process was successful. One way to do this is by using the `head()` and `tail()` functions.

The `head()` function displays the first `n` rows of the DataFrame, while the `tail()` function displays the last `n` rows. By default, `n=5`.

For example, let us view the first two rows of `df1` and `df2` from the earlier example:

“`

print(df1.head(2))

print(df2.head(2))

“`

Output:

“`

A B C

0 1 a 0.144214

1 2 b 0.782830

A B C

2 3 c 0.206456

3 4 d 0.808581

“`

As you can see, `df1` contains the first two rows of the original DataFrame, while `df2` contains the remaining three rows.

Conclusion

In this article, we explored how to split a Pandas DataFrame into multiple DataFrames based on certain criteria. We learned how to split a DataFrame into two parts using indexing, and how to use the `numpy.array_split` function to split a DataFrame into multiple parts.

By breaking down larger datasets into smaller, more manageable DataFrames, we can make our data analysis and manipulation tasks more efficient and accurate. Example 2: Split Pandas DataFrame into Multiple DataFrames

In this example, we will create a DataFrame with some student records and split it into multiple DataFrames based on the department of each student.

Step 1: Creating a DataFrame

Let us create a DataFrame with some student records. We will include columns for the student ID, name, age, GPA, and department.

“`

import pandas as pd

records = {

‘ID’: [1001, 1002, 1003, 1004, 1005, 1006, 1007, 1008],

‘Name’: [‘Alice’, ‘Bob’, ‘Charlie’, ‘David’, ‘Emma’, ‘Fiona’, ‘George’, ‘Hannah’],

‘Age’: [19, 20, 19, 20, 21, 22, 20, 21],

‘GPA’: [3.2, 3.5, 3.9, 3.4, 3.7, 3.8, 3.1, 3.6],

‘Department’: [‘CS’, ‘EC’, ‘CS’, ‘EC’, ‘CS’, ‘EC’, ‘CS’, ‘EC’]

}

df = pd.DataFrame(records)

print(df)

“`

Output:

“`

ID Name Age GPA Department

0 1001 Alice 19 3.2 CS

1 1002 Bob 20 3.5 EC

2 1003 Charlie 19 3.9 CS

3 1004 David 20 3.4 EC

4 1005 Emma 21 3.7 CS

5 1006 Fiona 22 3.8 EC

6 1007 George 20 3.1 CS

7 1008 Hannah 21 3.6 EC

“`

Step 2:

Splitting into Multiple DataFrames

Now let us split this DataFrame into multiple DataFrames based on the department of each student. We will create a dictionary to hold each department’s DataFrame.

“`

depts = {}

for dept in df.Department.unique():

depts[dept] = pd.DataFrame(df[df.Department == dept])

print(depts)

“`

Output:

“`

{

‘CS’:

ID Name Age GPA Department

0 1001 Alice 19 3.2 CS

2 1003 Charlie 19 3.9 CS

4 1005 Emma 21 3.7 CS

6 1007 George 20 3.1 CS,

‘EC’:

ID Name Age GPA Department

1 1002 Bob 20 3.5 EC

3 1004 David 20 3.4 EC

5 1006 Fiona 22 3.8 EC

7 1008 Hannah 21 3.6 EC

}

“`

As you can see, we have created two DataFrames, one for each department. Each DataFrame contains only the records of students in that department.

Step 3:

Viewing Resulting DataFrames

To check if the splitting process was successful, let us view the resulting DataFrames using the `head()` function. “`

for dept, df_dept in depts.items():

print(f”nStudents in Department: {dept}”)

print(df_dept.head())

“`

Output:

“`

Students in Department: CS

ID Name Age GPA Department

0 1001 Alice 19 3.2 CS

2 1003 Charlie 19 3.9 CS

4 1005 Emma 21 3.7 CS

6 1007 George 20 3.1 CS

Students in Department: EC

ID Name Age GPA Department

1 1002 Bob 20 3.5 EC

3 1004 David 20 3.4 EC

5 1006 Fiona 22 3.8 EC

7 1008 Hannah 21 3.6 EC

“`

As you can see, each DataFrame contains only the records of students in their respective departments.

Related Tutorials: Performing Common Functions with Pandas

To make data analysis and manipulation easier, Pandas provides a wide range of functions to perform common operations on DataFrames. Here are some of the most commonly used functions:

1.

Reading Data from CSV Files: The `read_csv()` function is used to read data from CSV files into a DataFrame. 2.

Filtering Data: The `loc[]` and `iloc[]` functions are used to filter DataFrames based on certain criteria. 3.

Sorting Data: The `sort_values()` function is used to sort DataFrames based on one or more columns. 4.

Grouping Data: The `groupby()` function is used to group DataFrames based on one or more columns. 5.

Merging DataFrames: The `merge()` function is used to merge two or more DataFrames into a single DataFrame. 6.

Pivot Tables: The `pivot_table()` function is used to create pivot tables from DataFrames. 7.

Plotting Data: The `plot()` function is used to create plots and graphs from DataFrames. By mastering these common functions, you can perform complex data analysis and manipulation tasks with ease using Pandas.

In this article, we explored how to split a Pandas DataFrame into multiple DataFrames based on specific criteria. We discussed two examples of splitting a DataFrame, one into two separate DataFrames and another into multiple DataFrames.

By breaking down large data sets into smaller, more manageable DataFrames, we can make data analysis and manipulation tasks more efficient and accurate. We also touched on some common functions of Pandas, such as filtering, sorting, grouping, merging, pivot tables, and plotting, that can simplify complex data analysis and manipulation tasks.

By mastering these techniques and functions, you can become an expert in handling large datasets, making your analysis and tasks more efficient.

Popular Posts