Adventures in Machine Learning

Mastering Data Manipulation with Pandas: Filtering and Grouping Techniques

Pandas is a widely used data manipulation library in Python that helps to organize, analyze, and manipulate complex datasets. The library provides a vast number of functions and features that can help you to work with datasets efficiently and effectively.

Two critical skills that every data analyst or scientist must have are finding the median and creating a Pandas DataFrame. In this article, we will explore these two skills in detail.

Finding Median of Pandas DataFrame

A median is essentially the midpoint value of a set of numerical data values, where half the data is on the left side of the median, and half is on the right. In Pandas, we can easily find the median of a single column using the `median()` function.

Let us see how to find the median of a single column and multiple columns.

Finding the Median of a Single Column

To find the median of a single column in a Pandas DataFrame, we can use the `median()` function. Here is an example of how to do that:

“`python

import pandas as pd

df = pd.read_csv(‘sample_data.csv’)

median_score = df[‘score’].median()

print(‘Median score:’, median_score)

“`

In the above example, we read a CSV file into a Pandas DataFrame and then accessed the ‘score’ column to find its median using the `median()` function. The output will be the median score of that column, displaying the result as a floating-point number.

Finding the Median of Multiple Columns

Similarly, you can also find the median of multiple columns in a Pandas DataFrame using the `median()` function. Here is an example of how to do that:

“`python

import pandas as pd

df = pd.read_csv(‘sample_data.csv’)

median_scores = df[[‘score_1’, ‘score_2’, ‘score_3’]].median()

print(‘Median scores:n’, median_scores)

“`

In the above example, we read the CSV file into a Pandas DataFrame and accessed three columns (‘score_1’, ‘score_2’, and ‘score_3’) to find their medians using the `median()` function. The output will be the median scores of those columns, displaying the result as a Pandas Series object.

Finding the Median of All Numeric Columns

In some cases, we may want to find the median of all the numeric columns in a Pandas DataFrame. To do this, we can use the `select_dtypes()` function to select only the numeric columns and then use the `median()` function.

Here is an example of how to do that:

“`python

import pandas as pd

df = pd.read_csv(‘sample_data.csv’)

numeric_columns = df.select_dtypes(include=[‘int64’, ‘float64’]).columns.tolist()

median_scores = df[numeric_columns].median()

print(‘Median scores:n’, median_scores)

“`

In the above example, we first used the `select_dtypes()` function to select only the numeric columns in the Pandas DataFrame and stored them in the `numeric_columns` list. Then, we used the list to access those columns and found their medians using the `median()` function.

The output will be the median scores of all the numeric columns in the DataFrame.

Creating a Pandas DataFrame with Python

Creating a Pandas DataFrame with Python is a useful and straightforward process. It involves three main steps: defining columns, adding data, and viewing the result.

Let us see how to do this in detail.

Defining Columns in a DataFrame

To define the columns of a Pandas DataFrame, we use the `DataFrame()` constructor and pass a dictionary that specifies the column names and their data types. Here is an example of how to define the columns of a Pandas DataFrame:

“`python

import pandas as pd

data = {‘Name’: [‘John’, ‘Amy’, ‘Peter’],

‘Age’: [25, 27, 31],

‘Gender’: [‘Male’, ‘Female’, ‘Male’]}

df = pd.DataFrame(data)

print(df)

“`

In the above example, we defined the columns of the DataFrame using a dictionary `data` that consists of three keys ‘Name’, ‘Age’, and ‘Gender’, and their respective values lists of data for each column. We then used the `DataFrame()` constructor with the dictionary to create the DataFrame and printed it.

Adding Data to a DataFrame

To add data to a Pandas DataFrame, we can use various functions such as `append()`, `loc[]`, `iloc[]`, and `at[]`. Here is an example of how to append data to a Pandas DataFrame:

“`python

import pandas as pd

df = pd.read_csv(‘sample_data.csv’)

new_row = {‘name’: ‘Mark’, ‘age’: 22, ‘score’: 90}

df = df.append(new_row, ignore_index=True)

print(df.tail())

“`

In the above example, we first read a CSV file into a Pandas DataFrame and then appended a new row to the DataFrame using the `append()` function. We passed a dictionary `new_row` that consists of three keys ‘name’, ‘age’, and ‘score’, and their respective values to add a new row to the DataFrame.

The `ignore_index=True` parameter ensures that the appended row is assigned a new index. Finally, we printed the updated DataFrame.

Viewing a DataFrame

To view a Pandas DataFrame, we can use the `head()` or `tail()` function, which displays the first five or last five rows of the DataFrame, respectively. We can also use the `iloc[]` and `loc[]` functions to access specific rows and columns based on their indices or labels.

Here is an example of how to view a Pandas DataFrame:

“`python

import pandas as pd

df = pd.read_csv(‘sample_data.csv’)

print(‘First five rows of the DataFrame:n’, df.head())

print(‘nLast five rows of the DataFrame:n’, df.tail())

print(‘nAccessing specific rows and columns of the DataFrame:n’, df.iloc[[0, 4], [1, 3]])

“`

In the above example, we first read a CSV file into a Pandas DataFrame and then used the `head()`, `tail()`, and `iloc[]` functions to view specific rows and columns of the DataFrame. The output displays the first five and last five rows of the DataFrame and accesses the first and fifth rows and the second and fourth columns of the DataFrame using the `iloc[]` function.

Conclusion

In this article, we explored two essential skills in Pandas finding the median of a Pandas DataFrame and creating a Pandas DataFrame with Python. We learned how to find the median of a single column, multiple columns, and all numeric columns in a Pandas DataFrame.

We also learned how to define columns, add data, and view a Pandas DataFrame. By mastering these skills, you can better analyze and manipulate complex datasets using Pandas in Python.

In this article, we will continue our exploration of Pandas by looking at two more essential skills – filtering Pandas DataFrame rows and grouping and aggregating data. With these skills, you can efficiently and effectively analyze large datasets using Pandas in Python.

Filtering Pandas DataFrame Rows

Filtering rows in a Pandas DataFrame is an essential task that allows us to extract only the data that meets specific conditions. There are several ways to filter rows based on conditions, range of values, or string values in a Pandas DataFrame.

Filtering Rows Based on a Condition

To filter rows in a Pandas DataFrame based on a specific condition, we use the `loc[]` function. The `loc[]` function takes a boolean expression that evaluates each row of the DataFrame and returns only the rows that satisfy it.

Here is an example of how to filter rows based on a condition:

“`python

import pandas as pd

df = pd.read_csv(‘sample_data.csv’)

filtered_df = df.loc[df[‘score’] > 85]

print(filtered_df.head())

“`

In the above example, we read a CSV file into a Pandas DataFrame and defined a condition to filter only the rows where the value in the ‘score’ column is greater than 85. We used the `loc[]` function with the condition to filter the rows and stored the result in a new DataFrame `filtered_df`.

Finally, we printed the first five rows of the filtered DataFrame using the `head()` function.

Filtering Rows Based on a Range of Values

To filter rows in a Pandas DataFrame based on a range of values, we can use the `between()` function. The `between()` function takes two values – a lower and an upper boundary – and returns only the rows that fall within that range.

Here is an example of how to filter rows based on a range of values:

“`python

import pandas as pd

df = pd.read_csv(‘sample_data.csv’)

filtered_df = df.loc[df[‘age’].between(20, 30)]

print(filtered_df.head())

“`

In the above example, we read a CSV file into a Pandas DataFrame and used the `between()` function to filter only the rows where the value in the ‘age’ column falls between 20 and 30 inclusive. We used the `loc[]` function with the condition to filter the rows and stored the result in a new DataFrame `filtered_df`.

Finally, we printed the first five rows of the filtered DataFrame using the `head()` function.

Filtering Rows Based on String Values

To filter rows in a Pandas DataFrame based on string values, we can use the `str.contains()` function. The `str.contains()` function takes a string and returns only the rows where the value in the column contains that string.

Here is an example of how to filter rows based on string values:

“`python

import pandas as pd

df = pd.read_csv(‘sample_data.csv’)

filtered_df = df.loc[df[‘name’].str.contains(‘Jo’)]

print(filtered_df.head())

“`

In the above example, we read a CSV file into a Pandas DataFrame and used the `str.contains()` function to filter only the rows where the value in the ‘name’ column contains the string ‘Jo’. We used the `loc[]` function with the condition to filter the rows and stored the result in a new DataFrame `filtered_df`.

Finally, we printed the first five rows of the filtered DataFrame using the `head()` function.

Grouping and Aggregating Data in Pandas DataFrame

Grouping and aggregating data in a Pandas DataFrame is a powerful way to calculate statistics and summarize large datasets. There are three main steps involved in grouping and aggregating data – grouping the data by one or more columns, applying a function to each group, and summarizing the results.

Grouping Data by One Column

To group data in a Pandas DataFrame by a single column, we use the `groupby()` function. The `groupby()` function takes the column name by which we want to group the data and returns a grouped object that we can use to apply functions to each group.

Here is an example of how to group data by one column:

“`python

import pandas as pd

df = pd.read_csv(‘sample_data.csv’)

grouped_data = df.groupby(‘gender’)

print(grouped_data.size())

“`

In the above example, we read a CSV file into a Pandas DataFrame and used the `groupby()` function to group the data by the ‘gender’ column. We then applied the `size()` function to each group to count the number of rows in each group.

The output will be a Pandas Series object that displays the number of rows in each gender group.

Grouping Data by Multiple Columns

To group data in a Pandas DataFrame by multiple columns, we can pass a list of column names to the `groupby()` function. The `groupby()` function will then group the data by each column in the list and return a grouped object that we can use to apply functions to each group.

Here is an example of how to group data by multiple columns:

“`python

import pandas as pd

df = pd.read_csv(‘sample_data.csv’)

grouped_data = df.groupby([‘gender’, ‘age’])

print(grouped_data.size())

“`

In the above example, we read a CSV file into a Pandas DataFrame and used the `groupby()` function to group the data by the ‘gender’ and ‘age’ columns. We then applied the `size()` function to each group to count the number of rows in each gender-age group.

The output will be a Pandas Series object that displays the number of rows in each gender-age group.

Aggregating Data with Functions

To aggregate data in a Pandas DataFrame, we use the `agg()` function. The `agg()` function takes a function that we want to apply to each group and returns a summary of the results for each group.

Here is an example of how to aggregate data with functions:

“`python

import pandas as pd

df = pd.read_csv(‘sample_data.csv’)

grouped_data = df.groupby([‘gender’, ‘age’])

agg_data = grouped_data.agg({‘score’: [‘min’, ‘max’, ‘mean’]})

print(agg_data.head())

“`

In the above example, we read a CSV file into a Pandas DataFrame and used the `groupby()` function to group the data by the ‘gender’ and ‘age’ columns. We then used the `agg()` function with a dictionary that specifies which function we want to apply to each column (‘score’) in each group.

The output will be a Pandas DataFrame that summarizes the minimum, maximum, and mean scores for each gender-age group.

Conclusion

In this article, we learned two more essential skills in Pandas – filtering Pandas DataFrame rows and grouping and aggregating data. We learned how to filter rows based on conditions, range of values, and string values, and how to group and aggregate data by one or multiple columns using functions.

By mastering these skills, you can better analyze and manipulate large datasets using Pandas in Python. In this article, we explored two fundamental skills in Pandas to work with data efficiently and effectively.

We started by discussing filtering Pandas DataFrame rows based on conditions, range of values, and string values. Next, we delved into grouping and aggregating data by one or multiple columns using functions to calculate statistics and summarize large datasets.

These skills are essential for data analysts and scientists to manage huge datasets and derive meaningful insights. By mastering these skills, you can make data analysis a seamless process and keep a grasp on the complex data.

Popular Posts