Adventures in Machine Learning

Efficiently Slice and Dice Your Data: Splitting a Pandas DataFrame into Multiple DataFrames

Splitting a Pandas DataFrame into Multiple DataFrames

Have you ever found yourself dealing with a massive dataset that seems impossible to manage? Perhaps you only want to work with a specific section of the data that is relevant to your analysis or task.

In this article, we explore how to split a Pandas DataFrame into multiple DataFrames, making it easier to work with and analyze the data.

Splitting into Two DataFrames

Let us start with the simpler task of splitting a DataFrame into two separate DataFrames. This can come in handy when you want to separate the data based on certain criteria, such as separating data from different time periods or separating data of different types.

To split a DataFrame into two parts, we can use the following code:

df1 = df.iloc[:n]
df2 = df.iloc[n:]

Where n is the index at which we want to split the DataFrame. df1 will contain the rows from the beginning up to index n-1, while df2 will contain the rows from index n up to the end of the DataFrame.

For example, let us create a simple DataFrame with some random data:

import pandas as pd
import numpy as np
df = pd.DataFrame({
    'A': [1, 2, 3, 4, 5],
    'B': ['a', 'b', 'c', 'd', 'e'],
    'C': np.random.rand(5)
})

print(df)

Output:

   A  B         C
0  1  a  0.144214
1  2  b  0.782830
2  3  c  0.206456
3  4  d  0.808581
4  5  e  0.839471

Now let us split this DataFrame into two parts, with the first two rows in one DataFrame and the remaining rows in another:

df1 = df.iloc[:2]
df2 = df.iloc[2:]

print(df1)
print(df2)

Output:

   A  B         C
0  1  a  0.144214
1  2  b  0.782830
   A  B         C
2  3  c  0.206456
3  4  d  0.808581
4  5  e  0.839471

As you can see, we have effectively split the original DataFrame into two separate DataFrames, df1 and df2, based on the index value n=2.

Splitting into Multiple DataFrames

If you need to split a DataFrame into multiple parts, you can use the numpy.array_split function. This function splits arrays or DataFrames into multiple sub-arrays or sub-DataFrames along a specified axis.

The basic syntax for this function is as follows:

parts = np.array_split(df, num)

Where num is the number of parts to split the DataFrame into. The resulting object parts will be a list of DataFrames, each containing a fraction of the original DataFrame.

For example, let us create another DataFrame with some random data:

df = pd.DataFrame({
    'A': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    'B': ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j'],
    'C': np.random.rand(10),
    'D': np.random.randn(10)
})

print(df)

Output:

    A  B         C         D
0   1  a  0.469512  0.399789
1   2  b  0.224657 -0.000644
2   3  c  0.943547 -0.626233
3   4  d  0.839175  1.154726
4   5  e  0.522775 -0.188685
5   6  f  0.992727 -1.007110
6   7  g  0.354648  0.292593
7   8  h  0.541027 -0.143968
8   9  i  0.811682  1.316059
9  10  j  0.047342 -1.311120

Now let us split this DataFrame into three parts, with each part containing three rows:

parts = np.array_split(df, 3)
print(parts[0])
print(parts[1])
print(parts[2])

Output:

   A  B         C         D
0  1  a  0.469512  0.399789
1  2  b  0.224657 -0.000644
2  3  c  0.943547 -0.626233
   A  B         C         D
3  4  d  0.839175  1.154726
4  5  e  0.522775 -0.188685
5  6  f  0.992727 -1.007110
   A  B         C         D
6  7  g  0.354648  0.292593
7  8  h  0.541027 -0.143968
8  9  i  0.811682  1.316059
9  10 j  0.047342 -1.311120

As you can see, we have now split the original DataFrame into three separate DataFrames, each with three rows.

Viewing Resulting DataFrames

It is always a good idea to visualize the resulting DataFrames to ensure that the splitting process was successful. One way to do this is by using the head() and tail() functions.

The head() function displays the first n rows of the DataFrame, while the tail() function displays the last n rows. By default, n=5.

For example, let us view the first two rows of df1 and df2 from the earlier example:

print(df1.head(2))
print(df2.head(2))

Output:

   A  B         C
0  1  a  0.144214
1  2  b  0.782830
   A  B         C
2  3  c  0.206456
3  4  d  0.808581

As you can see, df1 contains the first two rows of the original DataFrame, while df2 contains the remaining three rows.

Conclusion

In this article, we explored how to split a Pandas DataFrame into multiple DataFrames based on certain criteria. We learned how to split a DataFrame into two parts using indexing, and how to use the numpy.array_split function to split a DataFrame into multiple parts.

By breaking down larger datasets into smaller, more manageable DataFrames, we can make our data analysis and manipulation tasks more efficient and accurate. Example 2: Split Pandas DataFrame into Multiple DataFrames

In this example, we will create a DataFrame with some student records and split it into multiple DataFrames based on the department of each student.

Step 1: Creating a DataFrame

Let us create a DataFrame with some student records. We will include columns for the student ID, name, age, GPA, and department.

import pandas as pd
records = {
    'ID': [1001, 1002, 1003, 1004, 1005, 1006, 1007, 1008],
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Emma', 'Fiona', 'George', 'Hannah'],
    'Age': [19, 20, 19, 20, 21, 22, 20, 21],
    'GPA': [3.2, 3.5, 3.9, 3.4, 3.7, 3.8, 3.1, 3.6],
    'Department': ['CS', 'EC', 'CS', 'EC', 'CS', 'EC', 'CS', 'EC']
}
df = pd.DataFrame(records)

print(df)

Output:

     ID     Name  Age  GPA Department
0  1001    Alice   19  3.2         CS
1  1002      Bob   20  3.5         EC
2  1003  Charlie   19  3.9         CS
3  1004    David   20  3.4         EC
4  1005     Emma   21  3.7         CS
5  1006    Fiona   22  3.8         EC
6  1007   George   20  3.1         CS
7  1008   Hannah   21  3.6         EC

Step 2: Splitting into Multiple DataFrames

Now let us split this DataFrame into multiple DataFrames based on the department of each student. We will create a dictionary to hold each department’s DataFrame.

depts = {}
for dept in df.Department.unique():
    depts[dept] = pd.DataFrame(df[df.Department == dept])

print(depts)

Output:

{
    'CS': 
         ID     Name  Age  GPA Department
    0  1001    Alice   19  3.2         CS
    2  1003  Charlie   19  3.9         CS
    4  1005     Emma   21  3.7         CS
    6  1007   George   20  3.1         CS,
    
    'EC': 
         ID    Name  Age  GPA Department
    1  1002     Bob   20  3.5         EC
    3  1004   David   20  3.4         EC
    5  1006   Fiona   22  3.8         EC
    7  1008  Hannah   21  3.6         EC
}

As you can see, we have created two DataFrames, one for each department. Each DataFrame contains only the records of students in that department.

Step 3: Viewing Resulting DataFrames

To check if the splitting process was successful, let us view the resulting DataFrames using the head() function. “`

for dept, df_dept in depts.items():
    print(f"nStudents in Department: {dept}")
    print(df_dept.head())

Output:

Students in Department: CS
     ID     Name  Age  GPA Department
0  1001    Alice   19  3.2         CS
2  1003  Charlie   19  3.9         CS
4  1005     Emma   21  3.7         CS
6  1007   George   20  3.1         CS
Students in Department: EC
     ID    Name  Age  GPA Department
1  1002     Bob   20  3.5         EC
3  1004   David   20  3.4         EC
5  1006   Fiona   22  3.8         EC
7  1008  Hannah   21  3.6         EC

As you can see, each DataFrame contains only the records of students in their respective departments.

Related Tutorials: Performing Common Functions with Pandas

To make data analysis and manipulation easier, Pandas provides a wide range of functions to perform common operations on DataFrames. Here are some of the most commonly used functions:

  1. Reading Data from CSV Files: The read_csv() function is used to read data from CSV files into a DataFrame.
  2. Filtering Data: The loc[] and iloc[] functions are used to filter DataFrames based on certain criteria.
  3. Sorting Data: The sort_values() function is used to sort DataFrames based on one or more columns.
  4. Grouping Data: The groupby() function is used to group DataFrames based on one or more columns.
  5. Merging DataFrames: The merge() function is used to merge two or more DataFrames into a single DataFrame.
  6. Pivot Tables: The pivot_table() function is used to create pivot tables from DataFrames.
  7. Plotting Data: The plot() function is used to create plots and graphs from DataFrames.

By mastering these common functions, you can perform complex data analysis and manipulation tasks with ease using Pandas.

In this article, we explored how to split a Pandas DataFrame into multiple DataFrames based on specific criteria. We discussed two examples of splitting a DataFrame, one into two separate DataFrames and another into multiple DataFrames.

By breaking down large data sets into smaller, more manageable DataFrames, we can make data analysis and manipulation tasks more efficient and accurate. We also touched on some common functions of Pandas, such as filtering, sorting, grouping, merging, pivot tables, and plotting, that can simplify complex data analysis and manipulation tasks.

By mastering these techniques and functions, you can become an expert in handling large datasets, making your analysis and tasks more efficient.

Popular Posts