Filtering Pandas DataFrame Based on the Index
Pandas DataFrames are widely used for data manipulation and analysis in Python. They are tables with rows and columns, where each column is a variable and each row is an observation.
In Pandas, each row is identified by an index. The index can be numeric or non-numeric, and it can be used to filter the DataFrame.
In this article, we will discuss the different ways of filtering Pandas DataFrame based on the index.
1) Filtering Based on Numeric Index
A numeric index is a sequence of integers that starts from 0 and goes up to the number of rows in the DataFrame minus one.
To filter a DataFrame based on the numeric index, we can use the iloc
function. iloc
stands for integer location and it is used to locate a row or a column based on the integer position.
To keep only one row based on the index, we can use the iloc
function with the row number. For example, if we want to keep the third row of a DataFrame, we can use the following code:
df.iloc[2]
Here, 2 is the index of the third row because the index starts from 0.
The output will be a Series object that represents the third row of the DataFrame. To keep multiple rows based on the index, we can pass a list of integers to the iloc
function.
For example, if we want to keep the second and the fourth row of a DataFrame, we can use the following code:
df.iloc[[1, 3]]
Here, [1, 3]
is a list of the indices of the second and the fourth rows. The output will be a DataFrame that contains only the second and the fourth rows.
2) Filtering Based on Non-Numeric Index
A non-numeric index is a sequence of labels that can be strings, dates, or any other hashable object. To filter a DataFrame based on the non-numeric index, we can use the loc
function.
loc
stands for location and it is used to locate a row or a column based on the label. To keep a row based on the non-numeric index, we can use the loc
function with the label.
For example, if we have a DataFrame with a non-numeric index that represents dates, and we want to keep only the row that corresponds to a specific date, we can use the following code:
df.loc['2021-01-01']
Here, 2021-01-01
is the label of the row that corresponds to the date 2021-01-01
. The output will be a Series object that represents the row.
To keep multiple rows based on a specific string in the index, we can use the str
attribute of the index and the contains
method. For example, if we have a DataFrame with a non-numeric index that represents names, and we want to keep only the rows that contain the string John
, we can use the following code:
df.loc[df.index.str.contains('John')]
Here, df.index.str.contains('John')
returns a boolean series that represents the rows that contain the string John
.
The loc
function is used to filter the DataFrame based on this boolean series. The output will be a DataFrame that contains only the rows that contain the string John
.
Example of Filtering Pandas DataFrame Based on the Index:
To illustrate the different ways of filtering Pandas DataFrame based on the index, let’s create a sample DataFrame with numeric and non-numeric indices.
import pandas as pd
# Sample DataFrame with numeric index
data = {'A': [1, 2, 3, 4],
'B': [5, 6, 7, 8],
'C': [9, 10, 11, 12]}
df_numeric = pd.DataFrame(data, index=[0, 1, 2, 3])
# Sample DataFrame with non-numeric index
data = {'A': [1, 2, 3, 4],
'B': [5, 6, 7, 8],
'C': [9, 10, 11, 12]}
index = ['John Smith', 'Jessica Brown', 'John Doe', 'Mary Johnson']
df_non_numeric = pd.DataFrame(data, index=index)
Now, let’s filter these DataFrames based on the index.
Keeping Only One Row Based on Index:
To keep only one row based on the index of df_numeric
, we can use the iloc
function with the row number.
For example, to keep only the second row, we can use the following code:
df_numeric.iloc[1]
The output will be a Series object that represents the second row of the DataFrame. To keep only one row based on the index of df_non_numeric
, we can use the loc
function with the label.
For example, to keep only the row that corresponds to John Doe
, we can use the following code:
df_non_numeric.loc['John Doe']
The output will be a Series object that represents the row.
Keeping Multiple Rows Based on Index:
To keep multiple rows based on the index of df_numeric
, we can use the iloc
function with a list of row numbers.
For example, to keep the second and the fourth rows, we can use the following code:
df_numeric.iloc[[1, 3]]
The output will be a DataFrame that contains only the second and the fourth rows. To keep multiple rows based on the index of df_non_numeric
, we can use the loc
function with a boolean series that represents the rows that we want to keep.
For example, to keep the rows that contain the string John
in the index, we can use the following code:
df_non_numeric.loc[df_non_numeric.index.str.contains('John')]
The output will be a DataFrame that contains only the rows that contain the string John
in the index.
Conclusion:
Filtering Pandas DataFrame based on the index is a powerful technique that allows us to extract the rows that we need for our analysis.
We can filter based on the numeric index using the iloc
function, or based on the non-numeric index using the loc
function. By combining these functions with different arguments, we can keep only the rows that meet specific criteria, such as one row or multiple rows based on index values or label strings.
3) Syntax to Filter Pandas DataFrame Based on the Index
Filtering Pandas DataFrame based on the index is a useful technique for selecting a subset of rows based on specific index values or labels. In this section, we will discuss the syntax to filter Pandas DataFrame based on the index using the filter()
method with axis=0
.
The filter()
method is used to filter rows or columns based on specific criteria. The axis
parameter specifies whether the filtering should be done along rows (axis=0
) or columns (axis=1
).
When filtering rows based on the index, we use axis=0
. The syntax to filter Pandas DataFrame based on the index using the filter()
method with axis=0
is as follows:
df.filter(items=None, like=None, regex=None, axis=None)
The parameters of the filter()
method are:
items
: A list of index values or labels to keep.like
: A string to match against the index values or labels to keep.regex
: A regular expression to match against the index values or labels to keep.axis
: The axis along which to filter (0 for rows and 1 for columns).
To use the filter()
method with axis=0
to filter rows based on the index, we need to pass at least one of the parameters items
, like
, or regex
.
- To keep only specific index values or labels, we can use the
items
parameter. For example, to keep rows with index values 1 and 3, we can use the following code: - To keep only index values or labels that contain a specific substring, we can use the
like
parameter. - To keep only index values or labels that match a regular expression, we can use the
regex
parameter. For example, to keep rows with index values or labels that start with the string “J”, we can use the following code:
df.filter(items=[1, 3], axis=0)
For example, to keep rows with index values or labels that contain the string “John”, we can use the following code:
df.filter(like="John", axis=0)
df.filter(regex="^J", axis=0)
The regex "^J"
matches any string that starts with “J”.
4) Filtering Pandas DataFrame for Non-numeric Index
Filtering Pandas DataFrame based on the non-numeric index is a useful technique for selecting a subset of rows based on specific label values. In this section, we will provide an example of filtering Pandas DataFrame based on a non-numeric index using the filter()
method.
Let’s consider an example of a Pandas DataFrame with a non-numeric index that represents the names of individuals and their respective ages and salaries:
import pandas as pd
data = {'Age': [23, 40, 31, 27, 35],
'Salary': [50000, 75000, 100000, 60000, 80000]}
index = ['John Smith', 'Jessica Brown', 'John Doe', 'Mary Johnson', 'George Thompson']
df = pd.DataFrame(data, index=index)
print(df)
Output:
Age Salary
John Smith 23 50000
Jessica Brown 40 75000
John Doe 31 100000
Mary Johnson 27 60000
George Thompson 35 80000
Suppose we want to filter this DataFrame to include only the rows that represent individuals with the name “John”. We can achieve this using the filter()
method with the like
parameter:
df_filtered = df.filter(like='John', axis=0)
print(df_filtered)
Output:
Age Salary
John Smith 23 50000
John Doe 31 100000
The output only includes the rows with index labels that contain the string “John”.
Conclusion:
In this article, we discussed the different ways of filtering Pandas DataFrame based on the index.
We covered filtering based on numeric index using the iloc
function, and filtering based on non-numeric index using the loc
function. We also discussed the syntax to filter Pandas DataFrame based on the index using the filter()
method with axis=0
.
Finally, we provided an example of filtering Pandas DataFrame based on a non-numeric index using the filter()
method. By using these techniques, we can easily extract the rows that we need for our analysis, depending on the specific criteria that we define for the index.
5) Filtering Pandas DataFrame for Multiple Index Values that Contain Specific String
In addition to filtering Pandas DataFrame based on a single index value or label, we may need to filter based on multiple index values or labels that contain a specific string. In this section, we will provide an example of filtering Pandas DataFrame for multiple index values that contain a specific string.
Let’s consider an example of a Pandas DataFrame with a non-numeric index that represents the names of individuals and their respective ages and salaries:
import pandas as pd
data = {'Age': [23, 40, 31, 27, 35],
'Salary': [50000, 75000, 100000, 60000, 80000]}
index = ['John Smith', 'Jessica Brown', 'John Doe', 'Mary Johnson', 'George Thompson']
df = pd.DataFrame(data, index=index)
print(df)
Output:
Age Salary
John Smith 23 50000
Jessica Brown 40 75000
John Doe 31 100000
Mary Johnson 27 60000
George Thompson 35 80000
Suppose we want to filter this DataFrame to include only the rows that represent individuals whose names contain the string “John” or “Jessica”. We can achieve this using the filter()
method with the regex
parameter:
df_filtered = df.filter(regex="John|Jessica", axis=0)
print(df_filtered)
Output:
Age Salary
John Smith 23 50000
Jessica Brown 40 75000
John Doe 31 100000
The output only includes the rows with index labels that contain the strings “John” or “Jessica”. In this example, we used the regex
parameter of the filter()
method to specify a regular expression that matches any string that contains either “John” or “Jessica”.
The pipe character “|” is used to indicate a logical OR operation between the two strings.
6) Conclusion:
In this article, we discussed different techniques for filtering Pandas DataFrame based on the index.
We covered filtering based on numeric index and non-numeric index using the iloc
and loc
functions, respectively. Additionally, we showed how to use the filter()
method with the items
, like
, and regex
parameters to filter based on specific index values or labels.
Filtering Pandas DataFrame based on the index is a powerful tool for selecting a subset of rows based on specific criteria defined in the index. By using these techniques, we can easily extract the rows that we need for our analysis.
In this article, we discussed the different ways of filtering Pandas DataFrame based on the index, including filtering based on numeric and non-numeric index using the iloc
and loc
functions, respectively. We also covered the use of the filter()
method with the items
, like
, and regex
parameters to filter based on specific index values or labels.
Filtering Pandas DataFrame based on the index is a crucial technique for extracting the desired subset of rows based on specific criteria defined in the index. By using these techniques, we can easily filter the DataFrame to include only the desired rows for our analysis.
Overall, this article provides important techniques for data manipulation and analysis in Python that should be useful for anyone working with Pandas DataFrames.