Adventures in Machine Learning

Mastering Common Tasks in Pandas: Grouping Merging and Reshaping Data

Pandas is a popular data manipulation library in Python that helps users perform a variety of tasks with ease. Among the many powerful features that Pandas offers are the query() function and the ability to create DataFrames.

In this article, we will explore these features in detail and provide several examples of how to use them.

Using the Query() Function in Pandas

The query() function is a powerful tool that allows users to extract data from a DataFrame based on certain conditional statements. The syntax for the query() function is straightforward, with two main components: the dataframe object and the conditional statement(s).

Here is an example:

“`

import pandas as pd

df = pd.read_csv(‘data.csv’)

result = df.query(‘age > 30’)

“`

In this example, we have a DataFrame object `df` that contains information about people’s age and occupation. We want to extract all rows where the age is greater than 30.

We use the `query()` function to create a new DataFrame object called `result`. The conditional statement `’age > 30’` is passed as a string argument to the `query()` function.

Example 1: Extract Column Values Based on One Condition Being Met

Now that we know the syntax for the query() function, let’s look at a few examples of how to use it. In this first example, we will extract column values based on one condition being met.

Let’s say we have a DataFrame that contains the following information about students:

“`

import pandas as pd

data = {

‘name’: [‘Alice’, ‘Bob’, ‘Charlie’, ‘Dan’, ‘Eli’],

‘grade’: [80, 90, 85, 70, 95]

}

df = pd.DataFrame(data)

result = df.query(‘grade > 85’)

print(result[‘name’])

“`

In this example, we have a DataFrame `df` that contains the names and grades of five students. We want to extract the names of all students who scored higher than 85.

We use the `query()` function to create a new DataFrame object called `result`. The conditional statement `’grade > 85’` is passed as a string argument to the `query()` function.

Finally, we print out the names of the students whose grade was above 85. Example 2: Extract Column Values Based on One of Several Conditions Being Met

In this next example, we will extract column values based on one of several conditions being met.

Let’s say we have a DataFrame that contains information about customers’ age, income, and gender:

“`

import pandas as pd

data = {

‘age’: [27, 35, 50, 22, 67],

‘income’: [50000, 75000, 100000, 40000, 200000],

‘gender’: [‘Female’, ‘Male’, ‘Male’, ‘Female’, ‘Male’]

}

df = pd.DataFrame(data)

result = df.query(‘gender == “Female” or income > 75000’)

print(result[[‘age’, ‘gender’]])

“`

In this example, we want to extract the age and gender of customers who are either female or have an income of over $75,000. We use the `query()` function to create a new DataFrame object called `result`, and we pass the conditional statement `’gender == “Female” or income > 75000’` as a string argument to the function.

Finally, we print out the age and gender of the customers who meet these criteria. Example 3: Extract Column Values Based on Several Conditions Being Met

In this final example, we will extract column values based on several conditions being met.

Let’s say we have a DataFrame that contains information about the weather in different cities:

“`

import pandas as pd

data = {

‘city’: [‘New York’, ‘Los Angeles’, ‘Chicago’, ‘Houston’, ‘Phoenix’],

‘temperature’: [70, 80, 65, 85, 90],

‘humidity’: [50, 60, 70, 80, 90]

}

df = pd.DataFrame(data)

result = df.query(‘temperature > 75 and humidity < 75')

print(result[[‘city’, ‘temperature’, ‘humidity’]])

“`

In this example, we want to extract the city, temperature, and humidity of all cities where the temperature is above 75 degrees Fahrenheit and the humidity is below 75%. We use the `query()` function to create a new DataFrame object called `result`, and we pass the conditional statement `’temperature > 75 and humidity < 75'` as a string argument to the function.

Finally, we print out the city, temperature, and humidity columns of the resulting DataFrame.

Creating a DataFrame Using Pandas

Creating a DataFrame using Pandas is a straightforward process that involves passing Python dictionaries or lists to the `pd.DataFrame()` function. Here is an example:

“`

import pandas as pd

data = {

‘name’: [‘Alice’, ‘Bob’, ‘Charlie’, ‘Dan’, ‘Eli’],

‘age’: [25, 30, 35, 40, 45],

‘occupation’: [‘Engineer’, ‘Data Scientist’, ‘Sales Manager’, ‘Business Analyst’, ‘Software Developer’]

}

df = pd.DataFrame(data)

print(df)

“`

In this example, we define a Python dictionary called `data` that contains information about Alice, Bob, Charlie, Dan, and Eli. We then pass this dictionary to the `pd.DataFrame()` function to create a new DataFrame object called `df`.

Finally, we print out the entire DataFrame using the `print()` function.

Viewing a DataFrame

Once we have created a DataFrame, we may want to view its contents to make sure that it was created correctly. To do this, we can use the `head()` function, which displays the first five rows of the DataFrame:

“`

import pandas as pd

data = {

‘name’: [‘Alice’, ‘Bob’, ‘Charlie’, ‘Dan’, ‘Eli’],

‘age’: [25, 30, 35, 40, 45],

‘occupation’: [‘Engineer’, ‘Data Scientist’, ‘Sales Manager’, ‘Business Analyst’, ‘Software Developer’]

}

df = pd.DataFrame(data)

print(df.head())

“`

In this example, we create the same DataFrame as in the previous example. However, this time we use the `head()` function to display the first five rows of the DataFrame instead of printing the entire DataFrame.

This is useful when working with large datasets that may have many rows.

Conclusion

In this article, we explored the query() function in Pandas and how it can be used to extract data from a DataFrame based on certain conditional statements. We provided several examples of how to use the query() function to extract data based on one condition, one of several conditions, or multiple conditions.

We also looked at how to create a DataFrame using Pandas and how to view its contents using the head() function. By using these powerful tools, users can manipulate and analyze data with ease in Python.

In addition to the topics we have covered thus far, there are several other common tasks that users may want to perform when working with data using Pandas. In this article, we will explore some of these tasks and provide examples of how to perform them.

Common Tasks in Pandas

1. Grouping and Aggregating Data

Grouping and aggregating data is a common task that involves grouping rows based on one or more columns and applying an aggregate function (such as sum, mean, or count) to the groups.

Here is an example:

“`

import pandas as pd

data = {

‘city’: [‘New York’, ‘Chicago’, ‘New York’, ‘Chicago’, ‘New York’],

‘year’: [2016, 2016, 2017, 2017, 2018],

‘population’: [8406000, 2716000, 8538000, 2719000, 8379000]

}

df = pd.DataFrame(data)

result = df.groupby([‘city’, ‘year’]).sum()

print(result)

“`

In this example, we have a DataFrame `df` that contains information about the population of three cities (New York, Chicago, and Los Angeles) over a three-year period. We want to group the data by city and year and compute the sum of the population within each group.

We use the `groupby()` function to group the DataFrame by the `city` and `year` columns, and the `sum()` function to compute the sum of the population in each group. The resulting DataFrame `result` contains the grouped and aggregated data.

2. Merging DataFrames

Merging two or more DataFrames is a common task that involves combining rows from multiple DataFrames based on one or more columns.

Here is an example:

“`

import pandas as pd

data1 = {

‘key’: [‘A’, ‘B’, ‘C’, ‘D’],

‘value’: [1, 2, 3, 4]

}

data2 = {

‘key’: [‘B’, ‘D’, ‘E’, ‘F’],

‘value’: [5, 6, 7, 8]

}

df1 = pd.DataFrame(data1)

df2 = pd.DataFrame(data2)

result = pd.merge(df1, df2, on=’key’)

print(result)

“`

In this example, we have two DataFrames (`df1` and `df2`) that contain information about keys and their corresponding values. We want to merge the two DataFrames based on the `key` column.

We use the `merge()` function to merge the two DataFrames on the `key` column, and the resulting DataFrame `result` contains the merged data. 3.

Reshaping Data

Reshaping data is a common task that involves changing the structure of a DataFrame by pivoting, stacking, or unstacking the data. Here is an example:

“`

import pandas as pd

data = {

‘city’: [‘New York’, ‘Chicago’, ‘Los Angeles’],

‘2018’: [8406000, 2719000, 3999000],

‘2019’: [8399000, 2713000, 3976000],

‘2020’: [8337000, 2696000, 3954000],

}

df = pd.DataFrame(data)

result1 = df.melt(id_vars=[‘city’], var_name=’year’, value_name=’population’)

result2 = result1.pivot(index=’city’, columns=’year’, values=’population’)

print(result1)

print(result2)

“`

In this example, we have a DataFrame `df` that contains information about the population of three cities (New York, Chicago, and Los Angeles) over a three-year period. We want to reshape the data so that the years are the columns and the population is the values.

We use the `melt()` function to pivot the data, and then the `pivot()` function to reshape the data. The resulting DataFrames `result1` and `result2` contain the reshaped data.

4. Handling Missing Data

Handling missing data (also known as “NaN” values) is a common task that involves identifying and handling missing values in a DataFrame.

Here is an example:

“`

import pandas as pd

import numpy as np

data = {

‘A’: [1, 2, np.nan, 4],

‘B’: [5, np.nan, np.nan, 8],

‘C’: [9, 10, 11, 12]

}

df = pd.DataFrame(data)

# Drop rows with any NaN values

result1 = df.dropna()

# Fill NaN values with the mean value of the column

result2 = df.fillna(df.mean())

print(result1)

print(result2)

“`

In this example, we have a DataFrame `df` that contains some missing values (represented as NaN). We want to either drop rows with missing values or fill the missing values with a sensible approximation (such as the mean value of the column).

We use the `dropna()` function to drop rows with any NaN values, and the `fillna()` function to fill NaN values with the mean value of the column. The resulting DataFrames `result1` and `result2` contain the cleaned data.

Conclusion

In this article, we explored several additional common tasks in Pandas that users may encounter when working with data. These tasks included grouping and aggregating data, merging DataFrames, reshaping data, and handling missing data.

By mastering these common tasks, users can develop a deeper understanding of Pandas and how to use it to manipulate and analyze data with ease. In this article, we explored the query() function and DataFrame creation in Pandas, along with some additional common tasks such as grouping and aggregating data, merging DataFrames, reshaping data, and handling missing data.

We provided examples of each task and emphasized the importance of mastering these skills to manipulate and analyze data with ease. Overall, Pandas is a powerful tool for working with data, and by practicing these common tasks, users can develop stronger proficiency in using Pandas to enhance their data analysis and decision-making abilities.

Popular Posts