Adventures in Machine Learning

Efficiently Manipulate and Visualize Data with Pandas: Tips and Tricks

Data manipulation is a crucial aspect of data analysis, and Pandas is an excellent tool for manipulating and visualizing data. In this article, we will be exploring some of the useful functionalities of Pandas, namely selecting rows in a DataFrame containing a specific substring and creating a Pandas DataFrame.

1) Selecting rows that contain a specific substring in a Pandas DataFrame

1.1) Containing a specific substring

When working with large datasets in a Pandas DataFrame, it is essential to be able to filter specific rows based on a particular substring. For instance, if you are working with a dataset that contains a large number of customers’ names and need to select all the rows with a specific name or substring, you can achieve this using the Pandas str.contains() method.

import pandas as pd
data = pd.read_csv('customers.csv') #read the csv file
specific_name = 'John' #substring to be searched for
filtered_data = data[data['Name'].str.contains(specific_name)]

The str.contains() method searches the ‘Name’ column in the DataFrame for the ‘specific_name’ string.

1.2) Containing one substring OR another substring

Sometimes, we need to select rows that contain one substring OR another substring.

import pandas as pd
data = pd.read_csv('customers.csv') #read the csv file
substring_1 = 'John' #substring to be searched for
substring_2 = 'Mark' #substring to be searched for
filtered_data = data[data['Name'].str.contains(substring_1|substring_2)]

In this example, the method filters rows that contain either ‘John’ OR ‘Mark’ in the ‘Name’ column.

1.3) NOT contain given substrings

At times, you may want to exclude rows that contain specific substrings. An excellent way to accomplish this is by using the tilde (~) symbol along with str.contains().

import pandas as pd
data = pd.read_csv('customers.csv') #read the csv file
substring_1 = 'John' #substring to be excluded
substring_2 = 'Mark' #substring to be excluded
filtered_data = data[~data['Name'].str.contains(substring_1|substring_2)]

Here, the function filters out rows containing either ‘John’ OR ‘Mark’ in the ‘Name’ column.

1.4) Containing specific substring in the middle of a string

In some instances, you need to filter rows based on the middle of the string.

For instance, if a dataset contains email addresses, we may need to filter out only those that have ‘gmail’ in the middle of the string. We can do this by using regular expressions (regex) and the str.contains() method.

import pandas as pd
import re #import the re library for regex
data = pd.read_csv('customers.csv') #read the csv file
regex_pattern = '.*gmail.*' #pattern to be searched using regex
filtered_data = data[data['Email'].str.contains(regex_pattern, regex=True)]

The function filters out only those email addresses that contain ‘gmail’ in the middle.

1.5) Containing a specific numeric value

Sometimes, we need to filter rows based on a specific numeric value.

In this case, we can use lambda functions that specify the condition to be checked.

import pandas as pd
data = pd.read_csv('customers.csv') #read the csv file
min_age = 18 #minimum age to be filtered
max_age = 40 #maximum age to be filtered
age_filtered_data = data[data.apply(lambda x: (min_age <= x['Age'] <= max_age), axis=1)]

In this example, the filter function selects rows in the dataset that have an age between 18 and 40.

2) Example of a DataFrame

Creating a Pandas DataFrame is quite simple. We can define it using a python dictionary and then transposing the data using the .T() method.

import pandas as pd
data_dict = {'Name':['John', 'Mark', 'Suzan', 'Sarah'], 'Age':[25, 30, 32, 28], 'Gender':['Male', 'Male', 'Female', 'Female']}

data = pd.DataFrame(data_dict).T
print(data)

This function will produce a table that contains the name, age, and gender of each individual in the dataset.

Conclusion

Pandas is a powerful tool that can help manipulate large datasets. By using simple Pandas functions, we can filter rows based on substrings, numeric values, and more.

Similarly, creating a DataFrame is straightforward as we can define it using dictionaries and pandas methods. With these techniques, we hope that you can now efficiently work with data in Python using Pandas.

In conclusion, Pandas is a powerful tool that helps manipulate and visualize large datasets. Filtering rows based on specific substrings and creating DataFrames is made more accessible with Pandas's functionality.

The article covered several ways to select rows containing specific substrings, not contain given substrings, numeric values, or specific substrings in the middle of strings. Additionally, it demonstrated how to create a Pandas DataFrame efficiently.

By implementing these techniques, readers will be able to work with and manage data more efficiently in their Python projects.

Popular Posts