Adventures in Machine Learning

Mastering Data Filtering in Pandas – Techniques and Examples

Filtering Rows of a Pandas DataFrame with a List

As data analysts and data scientists, analyzing and manipulating data are some of the fundamental tasks we do on a daily basis. One of the most popular libraries used for data manipulation in Python is the Pandas library.

Pandas is incredibly powerful, and it offers a wide range of functionalities that can be used to manipulate and analyze data. One of these functionalities is filtering rows of a Pandas DataFrame with a list.

In this article, we will explore the syntax for filtering with a list, provide an example of filtering with a list and show how we can use isin() to filter by numeric values.

Syntax for filtering with a list

The syntax for filtering a Pandas DataFrame with a list involves using the isin() function. The isin() function in Pandas allows you to select rows of a DataFrame whose column value is in a list.

In the syntax below, we will be filtering a DataFrame called df by selecting rows where the column ‘my_column’ has values that are in a list called my_list.

df[df['my_column'].isin(my_list)]

Example of filtering with a list

Let’s consider a case where we want to filter a Pandas DataFrame by selecting only rows where a particular column has values in a list. In our demonstration, we will create a DataFrame of movies with columns for movie title, rating, genre, and release year.

import pandas as pd

# create the dataframe
movies = {"Title": ["The Shawshank Redemption", "The Godfather", "The Godfather: Part II", 
                    "The Dark Knight", "12 Angry Men", "Schindler's List"],
          "Rating": [9.3, 9.2, 9.0, 9.0, 8.9, 8.9],
          "Genre": ["Drama", "Crime, Drama", "Crime, Drama", "Action, Crime, Drama", "Drama", 
                    "Biography, Drama, History"],
          "Year Released":[1994, 1972, 1974, 2008, 1957, 1993]}
df = pd.DataFrame(movies)

Now that we have our DataFrame, we can filter it so that it only displays movies that were released in the 90s. To do this, we will create a list of years that fall within the 90s and use the isin() function to filter the DataFrame accordingly:

# create a list of the years for the 90s
years_90s = [1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999]

# filter the dataframe with the isin() function
df_90s = df[df['Year Released'].isin(years_90s)]

# print the filtered dataframe
print(df_90s)

The output of this code will be a DataFrame with only the movies released in the 90s:

                     Title  Rating                Genre  Year Released
0  The Shawshank Redemption     9.3                Drama           1994
5          Schindler's List     8.9  Biography, Drama, History           1993

Using isin() to filter by numeric values

The isin() function can also be used to filter a DataFrame with numeric values. In the example below, we will filter a DataFrame to only show rows where the values in a particular column are greater than a certain number.

Let’s consider a DataFrame called sales with columns for salesperson, region, and sales figures:

import pandas as pd

# create the dataframe
sales = {"Salesperson": ["Alice", "Bob", "Charlie", "Denise", "Emily"], 
         "Region": ["East", "West", "South", "North", "West"], 
         "Sales": [250, 150, 375, 210, 325]}
df = pd.DataFrame(sales)

In this example, we would like to filter the DataFrame to only show rows with sales figures greater than or equal to $300. This is how we can do it:

# filter the dataframe using isin() and specifying the condition
df_filtered = df[df['Sales'].isin(range(300, 451))]

# print the filtered dataframe
print(df_filtered)

The output of the code will be a DataFrame with only the rows that have sales figures greater than or equal to $300:

  Salesperson Region  Sales
0       Alice   East    250
2     Charlie  South    375
4       Emily   West    325

Example DataFrame for filtering

As stated earlier, we need a DataFrame to demonstrate how filtering works with Pandas. Let’s create a simple DataFrame of students with columns for name, age, and grade:

import pandas as pd

# create the dataframe
students = {"Name": ["John", "Jane", "Mary", "James", "Alex"], 
            "Age": [16, 17, 15, 16, 17], 
            "Grade": [10, 11, 9, 10, 11]}
df = pd.DataFrame(students)

With this DataFrame, we can demonstrate how filtering with a list works using the example shared above.

Conclusion

Filtering rows of a Pandas DataFrame with a list is a powerful technique that you can use to manipulate data in Python. Whether you are filtering by text or numeric values, the isin() function in Pandas provides a simple way to obtain only the rows that contain the information you’re interested in.

With this technique, you can easily create subsets of your DataFrame to carry out more complex analyses and answer critical business questions.

Filtering by Values in a Specific Column

As you continue to work with Python’s Pandas library, you’ll likely need to filter data in a DataFrame by values in a specific column. This process is relatively straightforward and involves using the .loc[] function, which enables you to retrieve specific rows from a DataFrame using label-based indexing.

In this section, we’ll examine the syntax for how to filter by values in a specific column, provide an example of how to use this technique to filter by the “team” column of a sport’s dataset, and show how you can create a filtered DataFrame that only contains specified values. Syntax for

Filtering by Values in a Specific Column

The syntax for filtering by values in a specific column is simple.

First, you’ll specify the DataFrame you want to filter, and then you’ll use the .loc[] function to filter by the specific column you’re interested in. Next, you’ll provide the filtering criteria you want to use for the column, and that’s it!

Below is an example of the syntax for filtering a DataFrame called df by values in the “team” column:

df_filtered = df.loc[df['team'] == 'Lakers']

In the example above, we are filtering the df DataFrame by only including rows where the value in the “team” column is “Lakers”.

The .loc[] function specifies the criteria that returns only the rows with “Lakers” in the “team” column. Example of Filtering by Values in the ‘team’ Column

To further illustrate how filtering by values in a specific column works, let’s look at an example of filtering a dataset of NBA teams by team name.

import pandas as pd

# create a dataframe
nba_teams = {"team": ["Lakers", "Warriors", "Knicks", "Clippers", "Nets"],
             "city": ["Los Angeles", "San Francisco", "New York", "Los Angeles", "Brooklyn"],
             "win_pct": [0.7, 0.6, 0.5, 0.8, 0.4]}
df = pd.DataFrame(nba_teams)

Now, let’s filter this DataFrame by the “team” column and create a new DataFrame that only includes rows with the value “Lakers” in the “team” column.

# filter the dataframe by 'Lakers'
df_filtered = df.loc[df['team'] == 'Lakers']

# print the filtered dataframe
print(df_filtered)

The output of this code will be:

      team         city  win_pct
0   Lakers  Los Angeles      0.7

As you can see, we have successfully filtered the DataFrame to only show rows with the value “Lakers” in the “team” column.

Filtered DataFrame Only Containing Specified Values

In some cases, rather than filtering to only include rows with a specified value in a column, you may want to create a new filtered DataFrame that only contains specific values from a particular column. This can be achieved through the following syntax:

df_filtered = df.loc[df['team'].isin(['Lakers', 'Knicks'])]

In the example above, we are creating a new DataFrame that only includes rows where the “team” column equals “Lakers” or “Knicks”.

The .isin() function specifies that we’re looking for these specific values, and the .loc[] function retrieves the matching rows.

Using isin() to Filter by Multiple Column Values

Sometimes you need to filter data from a DataFrame using multiple columns and specific values. In such cases, you can use a combination of the .isin() and .loc[] functions, where .isin() filters multiple values from a single column, and .loc[] retrieves rows that contain the values you’re looking for across multiple columns.

Syntax for

Using isin() to Filter by Multiple Column Values

Here is the syntax for using .loc[] and .isin() functions to filter by multiple column values:

df_filtered = df.loc[(df['column_1'].isin(['value_1', 'value_2'])) & 
                     (df['column_2'].isin(['value_1', 'value_2']))]

In the above example, we’re filtering the df DataFrame by values in both “column_1” and “column_2”. We’re looking for rows where both columns contain the values “value_1” and “value_2”.

Example of

Using isin() to Filter by Multiple Column Values

Consider the following DataFrame that contains the data on top countries for all different parameters. “`python

import pandas as pd

# create a dataframe
countries = {"country": ["USA", "China", "India", "Russia", "Japan"],
             "gdp": [21427.76, 14342.90, 2726.32, 1665.20, 5374.18],
             "population": [328.2, 1393.78, 1364.05, 146.79, 126.8],
             "area": [9.8, 9.6, 3.28, 17.1, 0.38]}
df = pd.DataFrame(countries)

Now, suppose that we want to filter this dataset and retrieve information on only two countries- USA and India, which have a high GDP and a large geographical area. This is how you can filter by multiple column values:

# filter the dataframe by the specified values
df_filtered = df.loc[(df['country'].isin(['USA', 'India'])) & 
                     (df['gdp'] > 2000) & 
                     (df['area'] > 7)]

# print the filtered dataframe
print(df_filtered)

The output of this code will be:

  country       gdp  population   area
0     USA  21427.76      328.20   9.80
2   India   2726.32     1364.05   3.28

In the above example, we have filtered the data by the specified values, and the resulting DataFrame contains only two rows – for the USA and India, which have a high GDP and a large area.

Conclusion

In conclusion, filtering a Pandas DataFrame by values in a column or multiple columns is a fundamental technique in data analysis. Through this technique, you can retrieve specific data and obtain insights into trends and patterns within your data.

Using the .loc[] and .isin() functions in combination helps you customize your filtering criteria and retrieve the specific data you need for your analyses. You can filter according to any criteria, including text, numerical values, or dates, and Pandas makes it incredibly easy to manipulate and filter large datasets using this technique.

Conclusion

In conclusion, we have covered different approaches to filtering data in a Pandas DataFrame based on specific criteria. Data filtering is a crucial step in the data preparation stage when working with large datasets.

Pandas provides both simple and complex methods to filter and extract relevant information, and understanding these approaches can help you to manipulate and analyze data with ease.

Recap of Filtering Methods

We started by looking at filtering rows of a Pandas DataFrame with a list. We examined the syntax for filtering with a list and provided an example of how to filter a DataFrame of movies by selecting only rows where the movies were released in the 90s.

We also demonstrated how to use the isin() function to filter by numeric values, allowing you to select rows whose column value is within a specified range. Next, we explored the technique of filtering by values in a specific column.

We saw how to use the .loc[] function to retrieve specific rows from a DataFrame using label-based indexing and how to create a filtered DataFrame that only contains particular values from a specific column. Finally, we learned about using the .isin() and .loc[] functions to filter by multiple column values.

We explored the syntax for using these two functions in combination and provided an example of how to filter a dataset of top countries based on different criteria such as GDP, population, and area. In conclusion, mastering the techniques for filtering data in a Pandas DataFrame is an important skill set of a data analyst and data scientist.

Filtering by column values provides valuable insights into trends and patterns contained within the data, helping to uncover business insights, understand consumer behavior, and support data-driven decision-making. In conclusion, filtering data is an essential technique in data preparation and analysis.

This article covered different approaches to filtering a Pandas DataFrame based on specific criteria. We explored the syntax and provided examples of filtering by different methods, such as filtering with a list, filtering by values in a specific column, and using the .isin() and .loc[] functions to filter by multiple column values.

Filtering by column values is a crucial step in extracting valuable insights from a dataset, and mastering these techniques will enable data analysts and data scientists to make well-informed data-driven decisions.

Popular Posts