Mastering Data Manipulation with Pandas: Filtering and Grouping Techniques

Pandas is a widely used data manipulation library in Python that helps to organize, analyze, and manipulate complex datasets. The library provides a vast number of functions and features that can help you to work with datasets efficiently and effectively.

Two critical skills that every data analyst or scientist must have are finding the median and creating a Pandas DataFrame. In this article, we will explore these two skills in detail.

Finding Median of Pandas DataFrame

A median is essentially the midpoint value of a set of numerical data values, where half the data is on the left side of the median, and half is on the right. In Pandas, we can easily find the median of a single column using the median() function.

Let us see how to find the median of a single column and multiple columns.

Finding the Median of a Single Column

To find the median of a single column in a Pandas DataFrame, we can use the median() function. Here is an example of how to do that:

import pandas as pd
df = pd.read_csv('sample_data.csv')
median_score = df['score'].median()
print('Median score:', median_score)

In the above example, we read a CSV file into a Pandas DataFrame and then accessed the ‘score’ column to find its median using the median() function. The output will be the median score of that column, displaying the result as a floating-point number.

Finding the Median of Multiple Columns

Similarly, you can also find the median of multiple columns in a Pandas DataFrame using the median() function. Here is an example of how to do that:

import pandas as pd
df = pd.read_csv('sample_data.csv')
median_scores = df[['score_1', 'score_2', 'score_3']].median()
print('Median scores:n', median_scores)

In the above example, we read the CSV file into a Pandas DataFrame and accessed three columns (‘score_1’, ‘score_2’, and ‘score_3’) to find their medians using the median() function. The output will be the median scores of those columns, displaying the result as a Pandas Series object.

Finding the Median of All Numeric Columns

In some cases, we may want to find the median of all the numeric columns in a Pandas DataFrame. To do this, we can use the select_dtypes() function to select only the numeric columns and then use the median() function.

Here is an example of how to do that:

import pandas as pd
df = pd.read_csv('sample_data.csv')
numeric_columns = df.select_dtypes(include=['int64', 'float64']).columns.tolist()
median_scores = df[numeric_columns].median()
print('Median scores:n', median_scores)

In the above example, we first used the select_dtypes() function to select only the numeric columns in the Pandas DataFrame and stored them in the numeric_columns list. Then, we used the list to access those columns and found their medians using the median() function.

The output will be the median scores of all the numeric columns in the DataFrame.

Creating a Pandas DataFrame with Python

Creating a Pandas DataFrame with Python is a useful and straightforward process. It involves three main steps: defining columns, adding data, and viewing the result.

Let us see how to do this in detail.

Defining Columns in a DataFrame

To define the columns of a Pandas DataFrame, we use the DataFrame() constructor and pass a dictionary that specifies the column names and their data types. Here is an example of how to define the columns of a Pandas DataFrame:

import pandas as pd
data = {'Name': ['John', 'Amy', 'Peter'],
        'Age': [25, 27, 31],
        'Gender': ['Male', 'Female', 'Male']}
df = pd.DataFrame(data)

print(df)

In the above example, we defined the columns of the DataFrame using a dictionary data that consists of three keys ‘Name’, ‘Age’, and ‘Gender’, and their respective values lists of data for each column. We then used the DataFrame() constructor with the dictionary to create the DataFrame and printed it.

Adding Data to a DataFrame

To add data to a Pandas DataFrame, we can use various functions such as append(), loc[], iloc[], and at[]. Here is an example of how to append data to a Pandas DataFrame:

import pandas as pd
df = pd.read_csv('sample_data.csv')
new_row = {'name': 'Mark', 'age': 22, 'score': 90}
df = df.append(new_row, ignore_index=True)
print(df.tail())

In the above example, we first read a CSV file into a Pandas DataFrame and then appended a new row to the DataFrame using the append() function. We passed a dictionary new_row that consists of three keys ‘name’, ‘age’, and ‘score’, and their respective values to add a new row to the DataFrame.

The ignore_index=True parameter ensures that the appended row is assigned a new index. Finally, we printed the updated DataFrame.

Viewing a DataFrame

To view a Pandas DataFrame, we can use the head() or tail() function, which displays the first five or last five rows of the DataFrame, respectively. We can also use the iloc[] and loc[] functions to access specific rows and columns based on their indices or labels.

Here is an example of how to view a Pandas DataFrame:

import pandas as pd
df = pd.read_csv('sample_data.csv')
print('First five rows of the DataFrame:n', df.head())
print('nLast five rows of the DataFrame:n', df.tail())
print('nAccessing specific rows and columns of the DataFrame:n', df.iloc[[0, 4], [1, 3]])

In the above example, we first read a CSV file into a Pandas DataFrame and then used the head(), tail(), and iloc[] functions to view specific rows and columns of the DataFrame. The output displays the first five and last five rows of the DataFrame and accesses the first and fifth rows and the second and fourth columns of the DataFrame using the iloc[] function.

Conclusion

In this article, we explored two essential skills in Pandas finding the median of a Pandas DataFrame and creating a Pandas DataFrame with Python. We learned how to find the median of a single column, multiple columns, and all numeric columns in a Pandas DataFrame.

We also learned how to define columns, add data, and view a Pandas DataFrame. By mastering these skills, you can better analyze and manipulate complex datasets using Pandas in Python.

In this article, we will continue our exploration of Pandas by looking at two more essential skills – filtering Pandas DataFrame rows and grouping and aggregating data. With these skills, you can efficiently and effectively analyze large datasets using Pandas in Python.

Filtering Pandas DataFrame Rows

Filtering rows in a Pandas DataFrame is an essential task that allows us to extract only the data that meets specific conditions. There are several ways to filter rows based on conditions, range of values, or string values in a Pandas DataFrame.

Filtering Rows Based on a Condition

To filter rows in a Pandas DataFrame based on a specific condition, we use the loc[] function. The loc[] function takes a boolean expression that evaluates each row of the DataFrame and returns only the rows that satisfy it.

Here is an example of how to filter rows based on a condition:

import pandas as pd
df = pd.read_csv('sample_data.csv')
filtered_df = df.loc[df['score'] > 85]
print(filtered_df.head())

In the above example, we read a CSV file into a Pandas DataFrame and defined a condition to filter only the rows where the value in the ‘score’ column is greater than 85. We used the loc[] function with the condition to filter the rows and stored the result in a new DataFrame filtered_df.

Finally, we printed the first five rows of the filtered DataFrame using the head() function.

Filtering Rows Based on a Range of Values

To filter rows in a Pandas DataFrame based on a range of values, we can use the between() function. The between() function takes two values – a lower and an upper boundary – and returns only the rows that fall within that range.

Here is an example of how to filter rows based on a range of values:

import pandas as pd
df = pd.read_csv('sample_data.csv')
filtered_df = df.loc[df['age'].between(20, 30)]
print(filtered_df.head())

In the above example, we read a CSV file into a Pandas DataFrame and used the between() function to filter only the rows where the value in the ‘age’ column falls between 20 and 30 inclusive. We used the loc[] function with the condition to filter the rows and stored the result in a new DataFrame filtered_df.

Finally, we printed the first five rows of the filtered DataFrame using the head() function.

Filtering Rows Based on String Values

To filter rows in a Pandas DataFrame based on string values, we can use the str.contains() function. The str.contains() function takes a string and returns only the rows where the value in the column contains that string.

Here is an example of how to filter rows based on string values:

import pandas as pd
df = pd.read_csv('sample_data.csv')
filtered_df = df.loc[df['name'].str.contains('Jo')]
print(filtered_df.head())

In the above example, we read a CSV file into a Pandas DataFrame and used the str.contains() function to filter only the rows where the value in the ‘name’ column contains the string ‘Jo’. We used the loc[] function with the condition to filter the rows and stored the result in a new DataFrame filtered_df.

Finally, we printed the first five rows of the filtered DataFrame using the head() function.

Grouping and Aggregating Data in Pandas DataFrame

Grouping and aggregating data in a Pandas DataFrame is a powerful way to calculate statistics and summarize large datasets. There are three main steps involved in grouping and aggregating data – grouping the data by one or more columns, applying a function to each group, and summarizing the results.

Grouping Data by One Column

To group data in a Pandas DataFrame by a single column, we use the groupby() function. The groupby() function takes the column name by which we want to group the data and returns a grouped object that we can use to apply functions to each group.

Here is an example of how to group data by one column:

import pandas as pd
df = pd.read_csv('sample_data.csv')
grouped_data = df.groupby('gender')
print(grouped_data.size())

In the above example, we read a CSV file into a Pandas DataFrame and used the groupby() function to group the data by the ‘gender’ column. We then applied the size() function to each group to count the number of rows in each group.

The output will be a Pandas Series object that displays the number of rows in each gender group.

Grouping Data by Multiple Columns

To group data in a Pandas DataFrame by multiple columns, we can pass a list of column names to the groupby() function. The groupby() function will then group the data by each column in the list and return a grouped object that we can use to apply functions to each group.

Here is an example of how to group data by multiple columns:

import pandas as pd
df = pd.read_csv('sample_data.csv')
grouped_data = df.groupby(['gender', 'age'])
print(grouped_data.size())

In the above example, we read a CSV file into a Pandas DataFrame and used the groupby() function to group the data by the ‘gender’ and ‘age’ columns. We then applied the size() function to each group to count the number of rows in each gender-age group.

The output will be a Pandas Series object that displays the number of rows in each gender-age group.

Aggregating Data with Functions

To aggregate data in a Pandas DataFrame, we use the agg() function. The agg() function takes a function that we want to apply to each group and returns a summary of the results for each group.

Here is an example of how to aggregate data with functions:

import pandas as pd
df = pd.read_csv('sample_data.csv')
grouped_data = df.groupby(['gender', 'age'])
agg_data = grouped_data.agg({'score': ['min', 'max', 'mean']})
print(agg_data.head())

In the above example, we read a CSV file into a Pandas DataFrame and used the groupby() function to group the data by the ‘gender’ and ‘age’ columns. We then used the agg() function with a dictionary that specifies which function we want to apply to each column (‘score’) in each group.

The output will be a Pandas DataFrame that summarizes the minimum, maximum, and mean scores for each gender-age group.

Conclusion

In this article, we learned two more essential skills in Pandas – filtering Pandas DataFrame rows and grouping and aggregating data. We learned how to filter rows based on conditions, range of values, and string values, and how to group and aggregate data by one or multiple columns using functions.

By mastering these skills, you can better analyze and manipulate large datasets using Pandas in Python. In this article, we explored two fundamental skills in Pandas to work with data efficiently and effectively.

We started by discussing filtering Pandas DataFrame rows based on conditions, range of values, and string values. Next, we delved into grouping and aggregating data by one or multiple columns using functions to calculate statistics and summarize large datasets.

These skills are essential for data analysts and scientists to manage huge datasets and derive meaningful insights. By mastering these skills, you can make data analysis a seamless process and keep a grasp on the complex data.

Adventures in Machine Learning

Mastering Data Manipulation with Pandas: Filtering and Grouping Techniques

Finding Median of Pandas DataFrame

Finding the Median of a Single Column

Finding the Median of Multiple Columns

Finding the Median of All Numeric Columns

Creating a Pandas DataFrame with Python

Defining Columns in a DataFrame

Adding Data to a DataFrame

Viewing a DataFrame

Conclusion

Filtering Pandas DataFrame Rows

Filtering Rows Based on a Condition

Filtering Rows Based on a Range of Values

Filtering Rows Based on String Values

Grouping and Aggregating Data in Pandas DataFrame

Grouping Data by One Column

Grouping Data by Multiple Columns

Aggregating Data with Functions

Conclusion

Popular Posts

Mastering TypeError: String Indices Must be Integers in Python

Interactive Python Programming: Taking User Input Made Easy

Mastering Python for Data Science and Machine Learning: Top 10 Modules and Libraries You Need to Know