Adventures in Machine Learning

Mastering Pandas DataFrame: Manipulation Aggregation and Grouping

Pandas is a popular open-source data analysis library for Python. It provides flexible and powerful tools for data manipulation, analysis, and visualization.

In this article, we will dive into two important topics in Pandas: adding rows to a DataFrame and creating and viewing a DataFrame.

Adding Rows to Pandas DataFrame

Let’s start with adding rows to a DataFrame. Sometimes, we might want to add one or more rows to a DataFrame after its creation.

There are two common ways to add rows to a DataFrame: adding one row and adding several rows.

Adding One Row

To add one row to a DataFrame, we can use the df.loc() method along with the length of the DataFrame’s index. Here’s an example:

import pandas as pd
# create a DataFrame
df = pd.DataFrame({'Name': ['John', 'Emma', 'Bob'],
                   'Age': [25, 30, 35],
                   'Gender': ['Male', 'Female', 'Male']})
# add a row
df.loc[len(df.index)] = ['Sarah', 28, 'Female']
# print the updated DataFrame
print(df)

Output:

    Name  Age  Gender
0   John   25    Male
1   Emma   30  Female
2    Bob   35    Male
3  Sarah   28  Female

Here, we have created a DataFrame with three rows and added a new row with Sarah’s information using the df.loc() method. The length of the DataFrame’s index is used to identify the position of the new row.

Adding Several Rows

If we need to add multiple rows to a DataFrame, we can use the df.append() method. We can create a new DataFrame with the rows we want to add and append it to the original DataFrame using the ignore_index argument.

Here’s an example:

import pandas as pd
# create a DataFrame
df = pd.DataFrame({'Name': ['John', 'Emma', 'Bob'],
                   'Age': [25, 30, 35],
                   'Gender': ['Male', 'Female', 'Male']})
# create a new DataFrame with additional rows
new_data = pd.DataFrame({'Name': ['Sarah', 'David'],
                          'Age': [28, 27],
                          'Gender': ['Female', 'Male']})
# append the new DataFrame to the original DataFrame
df = df.append(new_data, ignore_index=True)
# print the updated DataFrame
print(df)

Output:

    Name  Age  Gender
0   John   25    Male
1   Emma   30  Female
2    Bob   35    Male
3  Sarah   28  Female
4  David   27    Male

Here, we have created a new DataFrame new_data with two additional rows and appended it to the original DataFrame using the df.append() method. The ignore_index argument ensures that the index of the new DataFrame is adjusted to match that of the original DataFrame.

DataFrame Creation and View

Now, let’s move on to creating and viewing a DataFrame.

Creating a DataFrame is a basic step in any data analysis project.

We can create a DataFrame from various data sources such as CSV, Excel, SQL, or by directly initializing a DataFrame using Python lists, dictionaries, or arrays.

Creating a DataFrame

To create a DataFrame from Python lists, we can use the pd.DataFrame() method, which takes a dictionary as an argument. The dictionary keys represent the column names, and the values represent the data in the columns.

Here’s an example:

import pandas as pd
# create a DataFrame
df = pd.DataFrame({'Name': ['John', 'Emma', 'Bob'],
                   'Age': [25, 30, 35],
                   'Gender': ['Male', 'Female', 'Male']})
# print the DataFrame
print(df)

Output:

   Name  Age  Gender
0  John   25    Male
1  Emma   30  Female
2   Bob   35    Male

Here, we have initialized a DataFrame using a dictionary and printed it using the print() statement. Pandas automatically assigns a unique index to each row of the DataFrame.

Viewing a DataFrame

To view a DataFrame, we can use the df.head() and df.tail() methods. The df.head() method displays the first five rows of the DataFrame, while the df.tail() method displays the last five rows of the DataFrame.

Here’s an example:

import pandas as pd
# create a DataFrame
df = pd.DataFrame({'Name': ['John', 'Emma', 'Bob', 'Sarah', 'David', 'Katie'],
                   'Age': [25, 30, 35, 28, 27, 33],
                   'Gender': ['Male', 'Female', 'Male', 'Female', 'Male', 'Female']})
# view the first five rows of the DataFrame
print(df.head())

Output:

   Name  Age  Gender
0  John   25    Male
1  Emma   30  Female
2   Bob   35    Male
3  Sarah   28  Female
4  David   27    Male

Here, we have created a DataFrame with six rows and used the df.head() method to print the first five rows of the DataFrame. Similarly, we can use the df.tail() method to print the last five rows of the DataFrame.

Conclusion

In this article, we have covered two important topics in Pandas: adding rows to a DataFrame and creating and viewing a DataFrame. These are basic concepts necessary for any data analysis project using Pandas.

Pandas is an extremely powerful data analysis library with many more features and capabilities, which we have not covered in this article. With practice and further exploration, one can become proficient in utilizing Pandas for advanced data analysis and visualization.

DataFrame Selection and Filtering

DataFrames are arguably Pandas most used feature, and as such, it’s crucial to be well-versed in their selection and filtering methods. Here, we discuss two critical concepts: selecting specific columns and filtering rows.

Selecting Columns

While selecting specific columns, Pandas determines if every column in the DataFrame should be included or excluded. You can select a column by calling its name enclosed in square brackets after the DataFrame.

Here’s an example:

import pandas as pd
# create a DataFrame
df = pd.DataFrame({
   'Name': ['John', 'Emma', 'Bob', 'Sarah'],
   'Age': [23, 28, 35, 27],
   'Gender': ['Male', 'Female', 'Male', 'Female'],
   'Salary': [50000, 75000, 100000, 85000]
})
# Selecting specific columns
df[['Name', 'Salary']]

Output:

    Name  Salary
0   John   50000
1   Emma   75000
2    Bob  100000
3  Sarah   85000

Here, we’ve created a DataFrame, and then we’ve selected only the columns ‘Name’ and ‘Salary’ using the df[['Name', 'Salary']] method. DataFrame column selection returns a new DataFrame as a slice of the original with only the columns selected.

Filtering Rows

Filtering rows in a Pandas DataFrame is simply selecting specific rows based on criteria. To filter rows, we use a Boolean expression to create a filter that will match the rows we require.

Here’s an example:

import pandas as pd
# create a DataFrame
df = pd.DataFrame({
   'Name': ['John', 'Emma', 'Bob', 'Sarah'],
   'Age': [23, 28, 35, 27],
   'Gender': ['Male', 'Female', 'Male', 'Female'],
   'Salary': [50000, 75000, 100000, 85000]
})
# Filtering rows
df[df['Salary'] > 80000]

Output:

    Name  Age  Gender  Salary
2    Bob   35    Male  100000
3  Sarah   27  Female   85000

Here, we’ve filtered the original DataFrame using the condition df['Salary'] > 80000, which returned all the rows where a salary is higher than $80,000. Interested users can filter using multiple conditions, join the conditions with logical operators like & and | and parenthesize condition sets.

DataFrame Manipulation

DataFrames can be updated or modified with advanced manipulations using Pandas. Here, we discuss two critical concepts:

Updating Rows and

Deleting Rows.

Updating Rows

We may need to update an existing row of a DataFrame for different scenarios, like correct a wrong value in a row. This can be done with the df.loc call.

Here’s an example:

import pandas as pd
# Create DataFrame
df = pd.DataFrame({
   'Name': ['John', 'Emma', 'Bob', 'Sarah'],
   'Age': [23, 28, 35, 27],
   'Gender': ['Male', 'Female', 'Male', 'Female'],
   'Salary': [50000, 75000, 100000, 85000]
})
# Updating rows
df.loc[1, 'Salary'] = 70000

Output:

    Name  Age  Gender  Salary
0   John   23    Male   50000
1   Emma   28  Female   70000
2    Bob   35    Male  100000
3  Sarah   27  Female   85000

Here, we have selected the row of index 1 and updated its Salary column with the value of 70,000.

Deleting Rows

We may need to modify an existing DataFrame by deleting a selected row. This is often useful when a specific value is irrelevant, outdated, or wrong.

DataFrame rows can be removed with the df.drop() method, which returns a new DataFrame with the removed rows. Let’s understand this with an example:

import pandas as pd
# create a DataFrame
df = pd.DataFrame({
   'Name': ['John', 'Emma', 'Bob', 'Sarah', 'David'],
   'Age': [23, 28, 35, 27, 30],
   'Gender': ['Male', 'Female', 'Male', 'Female', 'Male'],
   'Salary': [50000, 75000, 100000, 85000, 65000]
})
# Deleting rows
df.drop(4, inplace=True)

Output:

    Name  Age  Gender  Salary
0   John   23    Male   50000
1   Emma   28  Female   75000
2    Bob   35    Male  100000
3  Sarah   27  Female   85000

Here, using the df.drop() method with an inplace=True argument, we have deleted the row with an index of 4.

Conclusion

In conclusion, Pandas is a highly flexible, powerful, and popular data analysis library. These are only the tip of the iceberg in terms of its full suite and capabilities.

By learning these Pandas DataFrame selection, filtering, and manipulation techniques, users can harness Pandas’ full potential in their data analysis projects.

DataFrames Aggregation and Grouping

In data analysis, we often need to summarize data to gain insights. Pandas offers many inbuilt aggregation functions and grouping methods to accomplish this task.

In this article, we’ll discuss two critical concepts in DataFrame aggregation and grouping techniques: aggregating data and grouping data.

Aggregating Data

Aggregating data is essentially the process of summarizing data. We can use functions like sum(), mean(), max(), min(), among others, to perform these operations.

Here’s an example:

import pandas as pd
# create a DataFrame
df = pd.DataFrame({
   'Name': ['John', 'Emma', 'Bob', 'Sarah'],
   'Age': [23, 28, 35, 27],
   'Gender': ['Male', 'Female', 'Male', 'Female'],
   'Salary': [50000, 75000, 100000, 85000]
})
# Aggregating data
print("Total Salary: ", df['Salary'].sum())
print("Average Age: ", df['Age'].mean())
print("Maximum Salary: ", df['Salary'].max())
print("Minimum Age: ", df['Age'].min())

Output:

Total Salary:  310000
Average Age:  28.25
Maximum Salary:  100000
Minimum Age:  23

Here, we have created a DataFrame and used pandas inbuilt aggregation functions to calculate the total salary, average age, maximum salary, and minimum age.

Grouping Data

Grouping data is a powerful way to segment data and then apply various functions to those groups. Here’s an example:

import pandas as pd
# create a DataFrame
df = pd.DataFrame({
   'Name': ['John', 'Emma', 'Bob', 'Sarah', 'Tom', 'Jessica'],
   'Age': [23, 28, 35, 27, 31, 29],
   'Gender': ['Male', 'Female', 'Male', 'Female', 'Male', 'Female'],
   'Salary': [50000, 75000, 100000, 85000, 92000, 80000]
})
# grouping data
grouped = df.groupby('Gender')
print(grouped['Salary'].mean())

Output:

Gender
Female    78333.333333
Male      80666.666667
Name: Salary, dtype: float64

Here, we have created a DataFrame and then grouped it based on the ‘Gender’ column using the groupby() method. After grouping based on ‘Gender’, we have used the mean() function to calculate the average salary of both males and females.

We can also use the agg method to compute multiple values for every group simultaneously:

grouped = df.groupby('Gender')
print(grouped['Salary'].agg(['count', 'mean', 'min', 'max']))

Output:

        count          mean    min     max
Gender                                    
Female      3  78333.333333  75000   85000
Male        3  80666.666667  50000  100000

Here, we have used the agg() method to calculate the count, mean, min, and max salary for males and females.

Conclusion

In conclusion, Pandas offers several in-built methods for data aggregation and grouping. Users can achieve meaningful insights into datasets by utilizing these powerful features in Pandas.

DataFrame aggregation and grouping are essential practices on the roadway to unlocking the full potential of Pandas to develop impactful data analysis. In conclusion, Pandas is an essential library for data manipulations and analysis in Python, and DataFrame is one of its critical features.

This article has discussed important concepts in DataFrame manipulation, including adding rows, selecting and filtering columns, manipulating rows, DataFrame creation/viewing, and aggregating/grouping data using Pandas. These are strongly recommended practices for anyone seeking to master data analysis with Pandas.

By the end of this article, it should be apparent that the ease of use and power of these tools make them imperative for any data analysis project. Embracing these techniques in Pandas will help users to extract insights from raw data.

Popular Posts