Adventures in Machine Learning

Pandas Data Analysis: Checking for Multiple Substrings in Strings

As data scientists and analysts, it is common to deal with large amounts of data, and one of the most useful tools for handling data in Python is the pandas library. Pandas provides a powerful data structure called DataFrame that allows us to organize and manipulate data in a tabular format.

In this article, we will explore how we can use pandas to check if a string in a DataFrame contains multiple substrings. 1.

Methods to check if a string in a pandas DataFrame contains multiple substrings

When working with data in pandas, it is common to encounter situations where we need to check if a string contains one or more substrings. Fortunately, pandas provides us with different methods to perform this task.

1.1 Check if string contains one of several substrings

Suppose we have a DataFrame that contains company names, and we want to check if any of the company names contain the word “bank” or “tech”. We can use the “contains” method along with the “|” operator to achieve this.

“`

import pandas as pd

data = {‘Company’:[‘Bank of America’, ‘Microsoft’, ‘Google’, ‘JPMorgan Chase’, ‘Apple’]}

df = pd.DataFrame(data)

# check if any company name contains the words “bank” or “tech”

df[df[‘Company’].str.contains(‘bank|tech’, case=False)]

“`

Here, the “contains” method searches for the words “bank” or “tech” in each company name. The “|” operator represents the OR logical operator, which means that if either of the words is present in the string, it will be selected.

Note that we set the “case” parameter to “False” to make the search case-insensitive. If we set it to “True”, the search would be case-sensitive, meaning that the method would only select strings containing “bank” or “tech” exactly as they appear.

1.2 Check if string contains multiple substrings

Now suppose we want to check if a string contains both the words “bank” and “tech”. We can use the “&” operator to achieve this.

“`

import pandas as pd

data = {‘Company’:[‘Bank of America’, ‘Microsoft’, ‘Google’, ‘JPMorgan Chase’, ‘Apple’]}

df = pd.DataFrame(data)

# check if any company name contains the words “bank” and “tech”

df[df[‘Company’].str.contains(‘bank’, case=False) & df[‘Company’].str.contains(‘tech’, case=False)]

“`

Here, we used two calls to the “contains” method, one for each substring, and combined them using the “&” operator, which represents the AND logical operator. The result is a DataFrame that only contains the rows where both substrings are present in the string.

2. Example of checking if a string in a pandas DataFrame contains multiple substrings

Now, let’s see an example where we put the previous methods into practice.

2.1 Creating a pandas DataFrame for the example

Suppose we are given a dataset that contains information about products sold by an online store. We want to select the products that belong to the “Electronics” category and have both “speakers” and “Bluetooth” in their descriptions.

We can create a DataFrame to simulate this data as follows:

“`

import pandas as pd

data = {‘Product’:[‘Sony SRS-XB33 speaker’, ‘iPhone 12 Pro’, ‘Samsung Galaxy Tab S7’, ‘JBL Flip 5 Bluetooth speaker’, ‘Apple Watch SE’],

‘Price’:[150, 999, 629, 119, 279],

‘Category’:[‘Electronics’, ‘Electronics’, ‘Tablet’, ‘Electronics’, ‘Electronics’],

‘Description’:[‘Experience powerful sound with the Sony SRS-XB33 speaker’, ‘The ultimate iPhone for power users’, ‘Unleash your creativity with the Samsung Galaxy Tab S7’, ‘Enhance your music experience with the JBL Flip 5 Bluetooth speaker’, ‘Stay connected with the Apple Watch SE’]}

df = pd.DataFrame(data)

“`

Our DataFrame has four columns: “Product”, “Price”, “Category”, and “Description”. The “Category” column contains the product categories, and the “Description” column contains a brief description of the products.

2.2 Checking if string contains one or multiple substrings with syntax examples

Let’s use the methods we learned in section 1 to select the products that meet our criteria. “`

# select products that belong to the “Electronics” category and have both “speakers” and “Bluetooth” in their descriptions

df[(df[‘Category’] == ‘Electronics’) & df[‘Description’].str.contains(‘speakers’, case=False) & df[‘Description’].str.contains(‘Bluetooth’, case=False)]

“`

The above code first filters the DataFrame by selecting rows that have “Electronics” in their “Category” column.

Then it uses the “&” operator to combine two calls to the “contains” method, one for “speakers” and the other for “Bluetooth”, with each method call wrapped in its respective DataFrame column. The result is a DataFrame that contains the selected products:

| | Product | Price | Category | Description |

|—:|:———————————-|——–:|:————-|:———————————————————–|

| 3 | JBL Flip 5 Bluetooth speaker | 119 | Electronics | Enhance your music experience with the JBL Flip 5 Bluetooth speaker |

The above result confirms that we correctly selected the only product that belongs to the “Electronics” category and has both “speakers” and “Bluetooth” in its description.

Conclusion

Pandas provides us with powerful tools to manipulate data, and knowing how to use them effectively can help us extract valuable insights quickly. In this article, we explored how to check if a string in a pandas DataFrame contains multiple substrings.

By using the methods “contains” and “&” and “|” operators, we can easily filter the DataFrame to select the desired rows. Whether we are dealing with a small or a large dataset, understanding these methods can significantly enhance our data analysis workflow.

In this informative article, we have learned about the different methods to check if a string in a pandas DataFrame contains multiple substrings. We explored the “| ” and “&” operators, along with the “contains” method to select specific rows in a DataFrame and filter data based on multiple criteria.

By using these powerful tools effectively, we can extract valuable insights from large datasets in a straightforward manner. As data scientists and analysts, understanding these methods can significantly enhance the data analysis workflow, and help in making informed decisions based on high-quality data.

Popular Posts