Adventures in Machine Learning

Mastering Pandas DataFrame: Manipulation Aggregation and Grouping

Pandas is a popular open-source data analysis library for Python. It provides flexible and powerful tools for data manipulation, analysis, and visualization.

In this article, we will dive into two important topics in Pandas: adding rows to a DataFrame and creating and viewing a DataFrame.

Adding Rows to Pandas DataFrame

Let’s start with adding rows to a DataFrame. Sometimes, we might want to add one or more rows to a DataFrame after its creation.

There are two common ways to add rows to a DataFrame: adding one row and adding several rows.

Adding One Row

To add one row to a DataFrame, we can use the `df.loc()` method along with the length of the DataFrame’s index. Here’s an example:

“`

import pandas as pd

# create a DataFrame

df = pd.DataFrame({‘Name’: [‘John’, ‘Emma’, ‘Bob’],

‘Age’: [25, 30, 35],

Gender’: [‘Male’, ‘Female’, ‘Male’]})

# add a row

df.loc[len(df.index)] = [‘Sarah’, 28, ‘Female’]

# print the updated DataFrame

print(df)

“`

Output:

“`

Name Age

Gender

0 John 25 Male

1 Emma 30 Female

2 Bob 35 Male

3 Sarah 28 Female

“`

Here, we have created a DataFrame with three rows and added a new row with Sarah’s information using the `df.loc()` method. The length of the DataFrame’s index is used to identify the position of the new row.

Adding Several Rows

If we need to add multiple rows to a DataFrame, we can use the `df.append()` method. We can create a new DataFrame with the rows we want to add and append it to the original DataFrame using the `ignore_index` argument.

Here’s an example:

“`

import pandas as pd

# create a DataFrame

df = pd.DataFrame({‘Name’: [‘John’, ‘Emma’, ‘Bob’],

‘Age’: [25, 30, 35],

Gender’: [‘Male’, ‘Female’, ‘Male’]})

# create a new DataFrame with additional rows

new_data = pd.DataFrame({‘Name’: [‘Sarah’, ‘David’],

‘Age’: [28, 27],

Gender’: [‘Female’, ‘Male’]})

# append the new DataFrame to the original DataFrame

df = df.append(new_data, ignore_index=True)

# print the updated DataFrame

print(df)

“`

Output:

“`

Name Age

Gender

0 John 25 Male

1 Emma 30 Female

2 Bob 35 Male

3 Sarah 28 Female

4 David 27 Male

“`

Here, we have created a new DataFrame `new_data` with two additional rows and appended it to the original DataFrame using the `df.append()` method. The `ignore_index` argument ensures that the index of the new DataFrame is adjusted to match that of the original DataFrame.

DataFrame Creation and View

Now, let’s move on to creating and viewing a DataFrame.

Creating a DataFrame is a basic step in any data analysis project.

We can create a DataFrame from various data sources such as CSV, Excel, SQL, or by directly initializing a DataFrame using Python lists, dictionaries, or arrays.

Creating a DataFrame

To create a DataFrame from Python lists, we can use the `pd.DataFrame()` method, which takes a dictionary as an argument. The dictionary keys represent the column names, and the values represent the data in the columns.

Here’s an example:

“`

import pandas as pd

# create a DataFrame

df = pd.DataFrame({‘Name’: [‘John’, ‘Emma’, ‘Bob’],

‘Age’: [25, 30, 35],

Gender’: [‘Male’, ‘Female’, ‘Male’]})

# print the DataFrame

print(df)

“`

Output:

“`

Name Age

Gender

0 John 25 Male

1 Emma 30 Female

2 Bob 35 Male

“`

Here, we have initialized a DataFrame using a dictionary and printed it using the `print()` statement. Pandas automatically assigns a unique index to each row of the DataFrame.

Viewing a DataFrame

To view a DataFrame, we can use the `df.head()` and `df.tail()` methods. The `df.head()` method displays the first five rows of the DataFrame, while the `df.tail()` method displays the last five rows of the DataFrame.

Here’s an example:

“`

import pandas as pd

# create a DataFrame

df = pd.DataFrame({‘Name’: [‘John’, ‘Emma’, ‘Bob’, ‘Sarah’, ‘David’, ‘Katie’],

‘Age’: [25, 30, 35, 28, 27, 33],

Gender’: [‘Male’, ‘Female’, ‘Male’, ‘Female’, ‘Male’, ‘Female’]})

# view the first five rows of the DataFrame

print(df.head())

“`

Output:

“`

Name Age

Gender

0 John 25 Male

1 Emma 30 Female

2 Bob 35 Male

3 Sarah 28 Female

4 David 27 Male

“`

Here, we have created a DataFrame with six rows and used the `df.head()` method to print the first five rows of the DataFrame. Similarly, we can use the `df.tail()` method to print the last five rows of the DataFrame.

Conclusion

In this article, we have covered two important topics in Pandas: adding rows to a DataFrame and creating and viewing a DataFrame. These are basic concepts necessary for any data analysis project using Pandas.

Pandas is an extremely powerful data analysis library with many more features and capabilities, which we have not covered in this article. With practice and further exploration, one can become proficient in utilizing Pandas for advanced data analysis and visualization.

DataFrame Selection and Filtering

DataFrames are arguably Pandas most used feature, and as such, it’s crucial to be well-versed in their selection and filtering methods. Here, we discuss two critical concepts: selecting specific columns and filtering rows.

Selecting Columns

While selecting specific columns, Pandas determines if every column in the DataFrame should be included or excluded. You can select a column by calling its name enclosed in square brackets after the DataFrame.

Here’s an example:

“`

import pandas as pd

# create a DataFrame

df = pd.DataFrame({

‘Name’: [‘John’, ‘Emma’, ‘Bob’, ‘Sarah’],

‘Age’: [23, 28, 35, 27],

Gender’: [‘Male’, ‘Female’, ‘Male’, ‘Female’],

‘Salary’: [50000, 75000, 100000, 85000]

})

# Selecting specific columns

df[[‘Name’, ‘Salary’]]

“`

Output:

“`

Name Salary

0 John 50000

1 Emma 75000

2 Bob 100000

3 Sarah 85000

“`

Here, we’ve created a DataFrame, and then we’ve selected only the columns ‘Name’ and ‘Salary’ using the `df[[‘Name’, ‘Salary’]]` method. DataFrame column selection returns a new DataFrame as a slice of the original with only the columns selected.

Filtering Rows

Filtering rows in a Pandas DataFrame is simply selecting specific rows based on criteria. To filter rows, we use a Boolean expression to create a filter that will match the rows we require.

Here’s an example:

“`

import pandas as pd

# create a DataFrame

df = pd.DataFrame({

‘Name’: [‘John’, ‘Emma’, ‘Bob’, ‘Sarah’],

‘Age’: [23, 28, 35, 27],

Gender’: [‘Male’, ‘Female’, ‘Male’, ‘Female’],

‘Salary’: [50000, 75000, 100000, 85000]

})

# Filtering rows

df[df[‘Salary’] > 80000]

“`

Output:

“`

Name Age

Gender Salary

2 Bob 35 Male 100000

3 Sarah 27 Female 85000

“`

Here, we’ve filtered the original DataFrame using the condition `df[‘Salary’] > 80000`, which returned all the rows where a salary is higher than $80,000. Interested users can filter using multiple conditions, join the conditions with logical operators like `&` and `|` and parenthesize condition sets.

DataFrame Manipulation

DataFrames can be updated or modified with advanced manipulations using Pandas. Here, we discuss two critical concepts:

Updating Rows and

Deleting Rows.

Updating Rows

We may need to update an existing row of a DataFrame for different scenarios, like correct a wrong value in a row. This can be done with the `df.loc` call.

Here’s an example:

“`

import pandas as pd

# Create DataFrame

df = pd.DataFrame({

‘Name’: [‘John’, ‘Emma’, ‘Bob’, ‘Sarah’],

‘Age’: [23, 28, 35, 27],

Gender’: [‘Male’, ‘Female’, ‘Male’, ‘Female’],

‘Salary’: [50000, 75000, 100000, 85000]

})

# Updating rows

df.loc[1, ‘Salary’] = 70000

“`

Output:

“`

Name Age

Gender Salary

0 John 23 Male 50000

1 Emma 28 Female 70000

2 Bob 35 Male 100000

3 Sarah 27 Female 85000

“`

Here, we have selected the row of index 1 and updated its Salary column with the value of 70,000.

Deleting Rows

We may need to modify an existing DataFrame by deleting a selected row. This is often useful when a specific value is irrelevant, outdated, or wrong.

DataFrame rows can be removed with the `df.drop()` method, which returns a new DataFrame with the removed rows. Let’s understand this with an example:

“`

import pandas as pd

# create a DataFrame

df = pd.DataFrame({

‘Name’: [‘John’, ‘Emma’, ‘Bob’, ‘Sarah’, ‘David’],

‘Age’: [23, 28, 35, 27, 30],

Gender’: [‘Male’, ‘Female’, ‘Male’, ‘Female’, ‘Male’],

‘Salary’: [50000, 75000, 100000, 85000, 65000]

})

# Deleting rows

df.drop(4, inplace=True)

“`

Output:

“`

Name Age

Gender Salary

0 John 23 Male 50000

1 Emma 28 Female 75000

2 Bob 35 Male 100000

3 Sarah 27 Female 85000

“`

Here, using the `df.drop()` method with an `inplace=True` argument, we have deleted the row with an index of 4.

Conclusion

In conclusion, Pandas is a highly flexible, powerful, and popular data analysis library. These are only the tip of the iceberg in terms of its full suite and capabilities.

By learning these Pandas DataFrame selection, filtering, and manipulation techniques, users can harness Pandas’ full potential in their data analysis projects.

DataFrames Aggregation and Grouping

In data analysis, we often need to summarize data to gain insights. Pandas offers many inbuilt aggregation functions and grouping methods to accomplish this task.

In this article, we’ll discuss two critical concepts in DataFrame aggregation and grouping techniques: aggregating data and grouping data.

Aggregating Data

Aggregating data is essentially the process of summarizing data. We can use functions like `sum()`, `mean()`, `max()`, `min()`, among others, to perform these operations.

Here’s an example:

“`

import pandas as pd

# create a DataFrame

df = pd.DataFrame({

‘Name’: [‘John’, ‘Emma’, ‘Bob’, ‘Sarah’],

‘Age’: [23, 28, 35, 27],

Gender’: [‘Male’, ‘Female’, ‘Male’, ‘Female’],

‘Salary’: [50000, 75000, 100000, 85000]

})

# Aggregating data

print(“Total Salary: “, df[‘Salary’].sum())

print(“Average Age: “, df[‘Age’].mean())

print(“Maximum Salary: “, df[‘Salary’].max())

print(“Minimum Age: “, df[‘Age’].min())

“`

Output:

“`

Total Salary: 310000

Average Age: 28.25

Maximum Salary: 100000

Minimum Age: 23

“`

Here, we have created a DataFrame and used pandas inbuilt aggregation functions to calculate the total salary, average age, maximum salary, and minimum age.

Grouping Data

Grouping data is a powerful way to segment data and then apply various functions to those groups. Here’s an example:

“`

import pandas as pd

# create a DataFrame

df = pd.DataFrame({

‘Name’: [‘John’, ‘Emma’, ‘Bob’, ‘Sarah’, ‘Tom’, ‘Jessica’],

‘Age’: [23, 28, 35, 27, 31, 29],

Gender’: [‘Male’, ‘Female’, ‘Male’, ‘Female’, ‘Male’, ‘Female’],

‘Salary’: [50000, 75000, 100000, 85000, 92000, 80000]

})

# grouping data

grouped = df.groupby(‘

Gender’)

print(grouped[‘Salary’].mean())

“`

Output:

“`

Gender

Female 78333.333333

Male 80666.666667

Name: Salary, dtype: float64

“`

Here, we have created a DataFrame and then grouped it based on the ‘

Gender’ column using the `groupby()` method. After grouping based on ‘

Gender’, we have used the `mean()` function to calculate the average salary of both males and females.

We can also use the `agg` method to compute multiple values for every group simultaneously:

“`

grouped = df.groupby(‘

Gender’)

print(grouped[‘Salary’].agg([‘count’, ‘mean’, ‘min’, ‘max’]))

“`

Output:

“`

count mean min max

Gender

Female 3 78333.333333 75000 85000

Male 3 80666.666667 50000 100000

“`

Here, we have used the `agg()` method to calculate the count, mean, min, and max salary for males and females.

Conclusion

In conclusion, Pandas offers several in-built methods for data aggregation and grouping. Users can achieve meaningful insights into datasets by utilizing these powerful features in Pandas.

DataFrame aggregation and grouping are essential practices on the roadway to unlocking the full potential of Pandas to develop impactful data analysis. In conclusion, Pandas is an essential library for data manipulations and analysis in Python, and DataFrame is one of its critical features.

This article has discussed important concepts in DataFrame manipulation, including adding rows, selecting and filtering columns, manipulating rows, DataFrame creation/viewing, and aggregating/grouping data using Pandas. These are strongly recommended practices for anyone seeking to master data analysis with Pandas.

By the end of this article, it should be apparent that the ease of use and power of these tools make them imperative for any data analysis project. Embracing these techniques in Pandas will help users to extract insights from raw data.

Popular Posts