Adventures in Machine Learning

Mastering Data Analysis with Pandas: Filtering Aggregating and Grouping Data

Data is essential in today’s world, where businesses operate in increasingly data-driven environments. To make sense of data, people use tools like pandas, a popular library in Python for data analysis.

In this article, we will cover two topics: selecting rows in pandas DataFrame based on a condition and creating a pandas DataFrame with sales information for various stores.

Selecting Rows in Pandas DataFrame Based on a Condition

Selecting rows in pandas DataFrame based on a condition is a useful technique for filtering and extracting data we want to analyze. Let us learn how to select rows that do not start with a specific string.

Syntax for Selecting Rows That Do Not Start with a Specific String:

To select rows that do not start with a particular string, we use the “~” operator and the “str.startswith()” function. Here is an example code snippet below:

import pandas as pd
# create a DataFrame
data = {"fruit": ["apple", "peach", "banana", "orange", "grape"], "price": [1.2, 2.1, 0.9, 1.5, 1.8]}
df = pd.DataFrame(data)
# select rows that do not start with the "o" string
df[~df['fruit'].str.startswith('o')]

In the above code example, we first created a DataFrame with a column “fruit” containing values – apple, peach, banana, orange, and grape. We then selected rows that do not start with the “o” string using the “~” operator and the “str.startswith()” function.

The output of the code is:

    fruit   price
0   apple   1.2
1   peach   2.1
2   banana  0.9
4   grape   1.8

Example of Using the Syntax

Let us use the above syntax to select rows that do not start with a specific string.

import pandas as pd
# create a DataFrame
data = {"name": ["John", "Peter", "David", "Oliver", "Karen"], "age": [35, 26, 31, 42, 28]}
df = pd.DataFrame(data)
# select rows that do not start with the "O" string
df[~df['name'].str.startswith('O')]

In the above code example, we created a DataFrame with a column “name” containing values – John, Peter, David, Oliver, and Karen. We then selected rows that do not start with the “O” string using the “~” operator and the “str.startswith()” function.

The output of the code is:

    name   age
0   John   35
1   Peter  26
2   David  31
4   Karen  28

Pandas DataFrame with Sales Information for Various Stores

Creating a pandas DataFrame with sales information for stores is a crucial step for the data analysis of stores. Let us walk through the process of creating a pandas DataFrame with sales information for various stores.

Creating a Pandas DataFrame with Sales Information for Stores:

We can create a pandas DataFrame with sales information for stores using the following code:

import pandas as pd
# create a DataFrame
data = {"store_id": ["Store-01", "Store-02", "Store-03", "Store-04", "Store-05"],
        "location": ["New York", "Boston", "Washington D.C.", "Chicago", "Los Angeles"],
        "sales_2019": [50000, 55000, 40000, 60000, 80000],
        "sales_2020": [40000, 45000, 35000, 55000, 75000]}
df = pd.DataFrame(data)
# view the created DataFrame
df

In the above code example, we first created a DataFrame with columns “store_id”, “location”, “sales_2019”, and “sales_2020”. We then viewed the created DataFrame using the “df” variable.

The output of the code is:

  store_id        location  sales_2019  sales_2020
0  Store-01        New York      50000      40000
1  Store-02          Boston      55000      45000
2  Store-03  Washington D.C.     40000      35000
3  Store-04         Chicago     60000      55000
4  Store-05     Los Angeles     80000      75000

Conclusion

In conclusion, pandas is a powerful library in Python for data analysis. In this article, we covered two topics: selecting rows in pandas DataFrame based on a condition and creating a pandas DataFrame with sales information for various stores.

We hope that this article has helped you to understand these topics, and you can apply them in your future data analysis projects. Pandas is a popular data analysis library in Python, and being able to selectively extract data from DataFrames is a fundamental skill in data analysis. Selecting rows based on conditions is a common task performed in data analysis, and in this article, we will be reviewing two topics – selecting rows based on a condition in a specific column and selecting rows with multiple conditions in a pandas DataFrame.

Selecting Rows Based on Condition in a Specific Column

Selecting rows based on a condition in a specific column is a powerful technique for filtering DataFrames to retrieve relevant data. To do this, you can use a boolean operator or an expression to subset data.

Syntax for Selecting Rows Based on Condition in a Specific Column:

df[df['column_name'] condition]

The ‘condition’ in the above syntax is a boolean or an expression that results in a boolean. The condition can be an equality operator, such as “==”, or a greater-than operator, such as “>.” The following example shows how you can use this syntax to extract rows based on the values in a specific column.

Example of using the Syntax:

Suppose we have a DataFrame with three columns as shown below:

import pandas as pd
data = {
    'Name': ['Ada', 'Bob', 'Charlie', 'David'],
    'Age': [16, 18, 21, 25],
    'Gender': ['Female', 'Male', 'Male', 'Male']
}
df = pd.DataFrame(data)
print(df)

This creates a DataFrame that looks like this:

      Name  Age  Gender
0      Ada   16  Female
1      Bob   18    Male
2  Charlie   21    Male
3    David   25    Male

Now suppose we want to select rows where the age is greater than or equal to 21. We can achieve this using the following code:

age_condition = df['Age'] >= 21
df[age_condition]

This returns a subset DataFrame with rows that satisfy the condition, as shown below:

      Name  Age Gender
2  Charlie   21   Male
3    David   25   Male

The above code creates a Boolean condition “age_condition” that checks whether the age column has values greater than or equal to 21. Using this Boolean condition, we can select a subset of the DataFrame that satisfies the condition.

Selecting Rows with Multiple Conditions in Pandas DataFrame

Another commonly used technique in data analysis is selecting rows with multiple conditions. In Pandas, we can select rows with multiple conditions by using Boolean operators like “and” and “or.”

Syntax for Selecting Rows with Multiple Conditions in Pandas DataFrame:

df[(condition1) & (condition2) & ... & (conditionN)]

The above syntax selects rows with multiple conditions, where “N” is the number of conditions you want to apply to the DataFrame.

Example of using the syntax:

Suppose we have a dataset with four columns – “Name,” “Age,” “Gender,” and “City,” as shown below:

import pandas as pd
data = {
    'Name': ['Ada', 'Bob', 'Charlie', 'David'],
    'Age': [16, 18, 21, 25],
    'Gender': ['Female', 'Male', 'Male', 'Male'],
    'City': ['New York', 'Boston', 'Chicago', 'Los Angeles']
}
df = pd.DataFrame(data)
print(df)

This creates a DataFrame that looks like this:

      Name  Age  Gender          City
0      Ada   16  Female      New York
1      Bob   18    Male        Boston
2  Charlie   21    Male       Chicago
3    David   25    Male  Los Angeles

Now suppose we want to select rows where the age is greater than or equal to 21 AND the city is “Boston”. We can achieve this using the following code:

age_condition = df['Age'] >= 21
city_condition = df['City'] == 'Boston'
df[age_condition & city_condition]

This returns a subset DataFrame with rows that satisfy both conditions, as shown below:

   Name  Age Gender    City
2  Charlie   21   Male  Chicago

The above code created two conditions: “age_condition” and “city_condition”, that check whether the age column has values greater than or equal to 21 and that the City column is equal to “Boston.” We then use the Boolean operator “&” to combine and filter the DataFrame based on both conditions.

Conclusion

In this article, we reviewed two techniques for selecting rows from a pandas DataFrame – selecting rows based on a condition in a specific column and selecting rows with multiple conditions. By using these techniques, you can extract only the rows you need in your data analysis projects. In data analysis, filtering and aggregation are common techniques used to analyze data and extract insights.

Pandas, a popular library in Python for data manipulation, provides several functions that make filtering and aggregation operations easier to perform. This article will cover two topics – filtering a pandas DataFrame based on multiple criteria and performing basic aggregation on a pandas DataFrame.

Filtering a Pandas DataFrame Based on Multiple Criteria

Filtering a pandas DataFrame based on multiple criteria is a technique that allows us to extract data that meets specific conditions. Suppose we want to filter a DataFrame to include only rows that meet two or more conditions, we can use the “loc” function in pandas to accomplish this.

The “loc” function is used to access a specific group of data in a DataFrame by label(s) or a Boolean/conditional lookup. Below is the syntax for filtering a pandas DataFrame based on multiple criteria.

Syntax for Filtering a Pandas DataFrame Based on Multiple Criteria:

df.loc[(df['column_name1'] condition1) & (df['column_name2'] condition2) & ... & (df['column_nameN'] conditionN)]

In the above syntax, “column_name” refers to the name of the column in the DataFrame, while “condition” is a Boolean expression that can be used to filter data based on different criteria.

Example of Using the Syntax:

Suppose we have a pandas DataFrame that contains information about different products, including the product name, price, and date. We want to filter the DataFrame to only include products that have a price greater than 50 and were sold in January 2021.

We can achieve this using the following code:

import pandas as pd
data = {
    'Product_Name': ['Product A', 'Product B', 'Product C', 'Product D'],
    'Price': [70, 45, 80, 60],
    'Sold_Date': ['2021-01-01', '2021-02-05', '2021-01-15', '2021-01-20']
}
df = pd.DataFrame(data)
df['Sold_Date'] = pd.to_datetime(df['Sold_Date'])
filtered_data = df.loc[(df['Price'] > 50) & (df['Sold_Date'].dt.month == 1) & (df['Sold_Date'].dt.year == 2021)]

In the above code example, we created a DataFrame with three columns – Product_Name, Price, and Sold_Date. We used the “loc” function to filter the DataFrame by specifying multiple criteria using Boolean expressions.

We used the “&” operator to combine all the conditions and then assigned the filtered data back to a new DataFrame called “filtered_data.”

Performing Aggregation on a Pandas DataFrame

In data analysis, aggregation involves summarizing data to extract insights from it. In Pandas, you can perform basic aggregation functions like “sum,” “mean,” “min,” “max,” and “count” on a DataFrame or a specific column.

Syntax for Performing Basic Aggregation on a Pandas DataFrame:

df['column_name']. aggregation_function()

The above syntax shows how you can apply an aggregation function to a specific column in a pandas DataFrame.

Example of Using the Syntax:

Suppose we have a pandas DataFrame that contains the sales data of different products across different stores. We want to perform basic aggregation on the data to get the total sales for each store.

We can achieve this using the following code:

import pandas as pd
data = {
    'Store': [1, 1, 2, 3, 3, 3],
    'Product': ['A', 'B', 'A', 'C', 'D', 'E'],
    'Sales': [100, 200, 150, 50, 75, 100]
}
df = pd.DataFrame(data)
store_sales = df.groupby('Store')['Sales'].sum()

In the above code example, we first created a DataFrame with three columns – Store, Product, and Sales. We then used the “groupby” function to group the data by the “Store” column.

Finally, we used the “sum” function to get the total sales for each store. The output of the “store_sales” variable is:

Store
1    300
2    150
3    225
Name: Sales, dtype: int64

Conclusion

In this article, we covered two techniques that are useful in data analysis – filtering a pandas DataFrame based on multiple criteria and performing basic aggregation on a pandas DataFrame. By using these techniques, you can extract relevant data and derive insights that can help in making informed decisions. In data analysis, grouping is a technique that allows you to categorize data based on a specific criterion.

Grouping data in a pandas DataFrame is a powerful technique that enables you to perform further analysis based on the categories. In this article, we will cover the syntax for grouping data in a pandas DataFrame and an example of using the syntax.

Grouping Data in a Pandas DataFrame

Grouping data in a pandas DataFrame involves splitting the data into different groups based on a specific criterion or combination of criteria. The criterion could be a column value or a combination of multiple column values.

By grouping data, you can create subsets of the larger DataFrame and perform operations on each of these subsets. The “groupby” function in pandas is used for grouping data.

Syntax for Grouping Data in a Pandas DataFrame:

df.groupby('column_name')

In the above syntax, “column_name” is the column you want to group by. You could also group data by multiple columns by specifying multiple column names.

Example of Using the Syntax:

Suppose we have a pandas DataFrame that contains data about different products sold in different stores, including the product name, store name, quantity sold, and price. We want to group the data by the store and get the total sales for each store.

We can achieve this using the following code:

import pandas as pd
data = {
    'Product_Name': ['Product A', 'Product B', 'Product C', 'Product D'],
    'Store_Name': ['Store 1', 'Store 2', 'Store 1', 'Store 3'],
    'Quantity_Sold': [10, 5, 8, 12],
    'Price': [10, 15, 12, 8]
}
df = pd.DataFrame(data)
store_sales = df.groupby('Store_Name')['Quantity_Sold'].sum()
print(store_sales)

In the above code example, we first created a DataFrame with four columns – Product_Name, Store_Name, Quantity_Sold, and Price. We then used the “groupby” function to group the data by the “Store_Name” column.

We then selected the “Quantity_Sold” column and used the “sum” function to get the total quantity sold for each store. The output of the “store_sales” variable is:

Store_Name
Store 1    18
Store 2     5
Store 3    12
Name: Quantity_Sold, dtype: int64

Conclusion

In this article, we covered the syntax for grouping data in a pandas DataFrame and an example of using the syntax. By using the “groupby” function, you can categorize your data into different groups and perform further analysis based on these groups.

Popular Posts