1) Introduction to pandas DataFrame
Pandas is a popular open-source data manipulation library in Python that offers versatile tools for data analysis, cleaning, and transformation. The library provides the conceptual framework for handling large datasets in a way that is efficient and streamlined.
1.1) What is a pandas DataFrame?
One of the most important building blocks of the Pandas library is the DataFrame, a two-dimensional table that stores data in a tabular form.to pandas DataFrame and its components
A pandas DataFrame is essentially a two-dimensional array with rows and columns. It is considered one of the most powerful tools in Pandas because it can handle structured data with ease.
1.2) Components of a pandas DataFrame
The DataFrame can store various types of data such as integers, floating-point numbers, and even text data. The structure of a DataFrame is organized into two components – the row-index and the column-index.
The row-index is vertical, while the column-index is horizontal. Accessing data within a DataFrame is easy because it allows for the use of labels and position-based indexing.
2) Working with a pandas DataFrame
2.1) Creating a pandas DataFrame with sample data
Creating a pandas DataFrame is a straightforward process. One of the easiest ways to create a DataFrame in Pandas is to use sample data.
Sample data is a subset of data that represents the larger dataset. Pandas comes with built-in functions that allow users to generate sample data.
For example, the pd.DataFrame()
function can be used to create a DataFrame from sample data. Here is an example of how to use this function to create a DataFrame:
import pandas as pd
data = {'name': ['John', 'Jane', 'Mike', 'Sara'],
'age': [25, 30, 22, 35],
'gender': ['M', 'F', 'M', 'F']}
df = pd.DataFrame(data)
print(df)
In this example, we have defined a dictionary called ‘data’ that contains three keys: ‘name’, ‘age’, and ‘gender’.
We have assigned lists as values to these keys. The pd.DataFrame()
function is used to create a DataFrame object called ‘df’ that stores the data from the dictionary.
We then print the DataFrame to verify its contents.
2.2) Viewing a pandas DataFrame
Once a DataFrame is created, it is essential to be able to view the data to ensure it is correct. Pandas provides several options to view data within a DataFrame.
The most common way to do this is by using the .head()
or .tail()
methods, which allows users to view the top or bottom of a DataFrame, respectively. Here is an example:
import pandas as pd
data = {'name': ['John', 'Jane', 'Mike', 'Sara'],
'age': [25, 30, 22, 35],
'gender': ['M', 'F', 'M', 'F']}
df = pd.DataFrame(data)
print(df.head(2))
Output:
name age gender
0 John 25 M
1 Jane 30 F
We have used the .head()
method to view the first two rows of the DataFrame.
The .tail()
method can be used in the same way but will show the last rows of the DataFrame.
2.3) Accessing the components of a pandas DataFrame
DataFrames have two main components: columns and rows. The column component consists of one-dimensional arrays with column names that define the data being stored.
The row component consists of the actual data stored in each of the columns. In pandas, columns can be accessed using column names and row using row indices.
2.3.1) Accessing columns
To access the columns of a DataFrame, we use the .columns
method, which returns an index object. The index object can be converted to a list to make the column names more accessible.
Here is an example:
import pandas as pd
data = {'name': ['John', 'Jane', 'Mike', 'Sara'],
'age': [25, 30, 22, 35],
'gender': ['M', 'F', 'M', 'F']}
df = pd.DataFrame(data)
print(list(df.columns))
Output:
['name', 'age', 'gender']
2.3.2) Accessing rows
To access rows, Pandas provides two methods, .loc
and .iloc
. The .loc
method is used to access rows and columns using labels, while the .iloc
method is used to access rows and columns using integer positions.
Here is an example:
import pandas as pd
data = {'name': ['John', 'Jane', 'Mike', 'Sara'],
'age': [25, 30, 22, 35],
'gender': ['M', 'F', 'M', 'F']}
df = pd.DataFrame(data)
print(df.loc[2])
Output:
name Mike
age 22
gender M
Name: 2, dtype: object
In this example, we have used the .loc
method to access the row with an index of 2. The output shows the values and their respective column names for the row with index 2.
2.4) Finding the Index with Maximum Value in a Pandas DataFrame
Pandas makes it easy to find the index of the maximum value in a DataFrame. Suppose you have a DataFrame containing data from a sporting event, and you want to know which athlete had the best performance and the index in which he/she achieved that performance.
In this example, we will use the same DataFrame as before, with added data for each athlete’s scores.
import pandas as pd
data = {'name': ['John', 'Jane', 'Mike', 'Sara'],
'age': [25, 30, 22, 35],
'gender': ['M', 'F', 'M', 'F'],
'score1': [70, 85, 90, 95],
'score2': [80, 90, 85, 89],
'score3': [75, 95, 92, 90]}
df = pd.DataFrame(data)
print(df)
Output:
name age gender score1 score2 score3
0 John 25 M 70 80 75
1 Jane 30 F 85 90 95
2 Mike 22 M 90 85 92
3 Sara 35 F 95 89 90
Example 1: Finding index that has max value for each column
To find the index that has the highest value for each column, we use the .idxmax()
method.
This method returns the index that has the maximum value for each column. Here is an example:
import pandas as pd
data = {'name': ['John', 'Jane', 'Mike', 'Sara'],
'age': [25, 30, 22, 35],
'gender': ['M', 'F', 'M', 'F'],
'score1': [70, 85, 90, 95],
'score2': [80, 90, 85, 89],
'score3': [75, 95, 92, 90]}
df = pd.DataFrame(data)
print(df.idxmax(axis=0, skipna=True))
Output:
name 3
age 3
gender 1
score1 3
score2 1
score3 1
dtype: int64
In the above example, we have found the index with the maximum values for each column.
The .idxmax()
method has been called with axis=0
to specify the column as the axis of operation. The skipna
parameter is set to True
to exclude NaN values from the calculation.
Example 2: Finding column that has max value for each row
To find the column that has the maximum value for each row, we use the .idxmax()
method with the axis
parameter set to 1. This method returns the column name with the maximum value for each row.
Here is an example:
import pandas as pd
data = {'name': ['John', 'Jane', 'Mike', 'Sara'],
'age': [25, 30, 22, 35],
'gender': ['M', 'F', 'M', 'F'],
'score1': [70, 85, 90, 95],
'score2': [80, 90, 85, 89],
'score3': [75, 95, 92, 90]}
df = pd.DataFrame(data)
print(df.idxmax(axis=1, skipna=True))
Output:
0 score2
1 score3
2 score3
3 score1
dtype: object
In the above example, we have found the column with the maximum values for each row. The .idxmax()
method has been called with axis=1
to specify the row as the axis of operation.
The skipna
parameter is set to True
to exclude NaN values from the calculation.
Conclusion:
Pandas is a powerful data manipulation tool that allows users to work with large datasets elegantly.
The DataFrame is a two-dimensional table that is the building block of Pandas. In this article, we have discussed how to create a Pandas DataFrame with sample data and view it using commonly used methods.
We have also explored how to access the components of a DataFrame – columns and rows – and how to use the .idxmax()
method to find the location of the maximum values within a DataFrame. We hope this article has been useful in helping you get started with Pandas and its DataFrame.
3) Manipulating Data in a pandas DataFrame
Data manipulation is an essential part of data analysis. Pandas provides a range of functions that allow users to manipulate data stored in a DataFrame.
In this section, we will explore several common data manipulation methods used in Pandas.
3.1) Adding a new column to a pandas DataFrame
Adding a new column to a pandas DataFrame is relatively simple. We can use the []
operator to create a new column and assign values to it.
Here’s an example:
import pandas as pd
data = {'name': ['John', 'Jane', 'Mike', 'Sara'],
'age': [25, 30, 22, 35],
'gender': ['M', 'F', 'M', 'F']}
df = pd.DataFrame(data)
df['score'] = [70, 85, 90, 95]
print(df)
Output:
name age gender score
0 John 25 M 70
1 Jane 30 F 85
2 Mike 22 M 90
3 Sara 35 F 95
In this example, we have created a new column called ‘score’ and assigned values to it using the []
operator.
3.2) Dropping a column or row from a pandas DataFrame
Sometimes it is necessary to remove certain columns or rows from a DataFrame. We can do this using the .drop()
method, which allows us to specify the columns or rows that we wish to remove.
Here’s an example:
import pandas as pd
data = {'name': ['John', 'Jane', 'Mike', 'Sara'],
'age': [25, 30, 22, 35],
'gender': ['M', 'F', 'M', 'F']}
df = pd.DataFrame(data)
df = df.drop('age', axis=1)
print(df)
Output:
name gender
0 John M
1 Jane F
2 Mike M
3 Sara F
In this example, we have used the .drop()
method to remove the ‘age’ column from the DataFrame. We have used the axis
parameter with a value of 1 to indicate that we want to remove a column.
To remove a row, we can pass the index of the row to the .drop()
method and specify that we want to remove a row. Here’s an example:
import pandas as pd
data = {'name': ['John', 'Jane', 'Mike', 'Sara'],
'age': [25, 30, 22, 35],
'gender': ['M', 'F', 'M', 'F']}
df = pd.DataFrame(data)
df = df.drop(2, axis=0)
print(df)
Output:
name age gender
0 John 25 M
1 Jane 30 F
3 Sara 35 F
In this example, we have used the .drop()
method to remove the row with an index of 2.
3.3) Renaming columns or index in a pandas DataFrame
Renaming columns or the index of a DataFrame is useful when we want to make the DataFrame more readable. We can use the .rename()
method to rename columns or the index.
Here’s an example:
import pandas as pd
data = {'name': ['John', 'Jane', 'Mike', 'Sara'],
'age': [25, 30, 22, 35],
'gender': ['M', 'F', 'M', 'F']}
df = pd.DataFrame(data)
df = df.rename(columns={'name': 'full_name', 'age': 'age_years'})
print(df)
Output:
full_name age_years gender
0 John 25 M
1 Jane 30 F
2 Mike 22 M
3 Sara 35 F
In this example, we have used the .rename()
method to rename the ‘name’ and ‘age’ columns to ‘full_name’ and ‘age_years’, respectively. We have passed a dictionary to the columns
parameter to indicate the new names for the columns.
We can rename the index using the .rename()
method as well. Here’s an example:
import pandas as pd
data = {'name': ['John', 'Jane', 'Mike', 'Sara'],
'age': [25, 30, 22, 35],
'gender': ['M', 'F', 'M', 'F']}
df = pd.DataFrame(data)
df = df.rename(index={0: 'row1', 1: 'row2', 2: 'row3', 3: 'row4'})
print(df)
Output:
name age gender
row1 John 25 M
row2 Jane 30 F
row3 Mike 22 M
row4 Sara 35 F
In this example, we have used the .rename()
method to rename the index.
We have passed a dictionary to the index
parameter to indicate the new names for the rows.
3.4) Applying functions to a pandas DataFrame
Applying functions to a DataFrame is another important way to manipulate data. We can use the .apply()
method to apply a function to each value in a DataFrame.
Here’s an example:
import pandas as pd
def double(x):
return x * 2
data = {'name': ['John', 'Jane', 'Mike', 'Sara'],
'age': [25, 30, 22, 35],
'gender': ['M', 'F', 'M', 'F']}
df = pd.DataFrame(data)
df = df.applymap(double)
print(df)
Output:
name age gender
0 JohnJohn 50 MM
1 JaneJane 60 FF
2 MikeMike 44 MM
3 SaraSara 70 FF
In this example, we have defined a function called ‘double’ that doubles a given value. We have used the .applymap()
method to apply this function to all values in the DataFrame.
4) Filtering Data in a pandas DataFrame
Filtering data is another essential aspect of data manipulation. Pandas provides several ways to filter data in a DataFrame.
One of the most common ways to filter data is using boolean indexing. Boolean indexing involves creating a boolean array with the same size as the DataFrame. This array is used to select the rows that meet the specified criteria.
4.1) Filtering data using boolean indexing
Here’s an example:
import pandas as pd
data = {'name': ['John', 'Jane', 'Mike', 'Sara'],
'age': [25, 30, 22, 35],
'gender': ['M', 'F', 'M', 'F'],
'score1': [70, 85, 90, 95],
'score2': [80, 90, 85, 89],
'score3': [75, 95, 92, 90]}
df = pd.DataFrame(data)
filtered_df = df[df['age'] > 25]
print(filtered_df)
Output:
name age gender score1 score2 score3
1 Jane 30 F 85 90 95
3 Sara 35 F 95 89 90
In this example, we have used boolean indexing to filter the DataFrame based on the ‘age’ column. The df['age'] > 25
expression creates a boolean array where True
represents rows where ‘age’ is greater than 25 and False
represents rows where ‘age’ is less than or equal to 25. This array is then used to select the rows that meet the criteria.
4.2) Filtering data using the .query() method
Pandas also provides the .query()
method for filtering data based on expressions. The .query()
method is similar to boolean indexing, but it allows for more complex expressions to be used.
Here’s an example:
import pandas as pd
data = {'name': ['John', 'Jane', 'Mike', 'Sara'],
'age': [25, 30, 22, 35],
'gender': ['M', 'F', 'M', 'F'],
'score1': [70, 85, 90, 95],
'score2': [80, 90, 85, 89],
'score3': [75, 95, 92, 90]}
df = pd.DataFrame(data)
filtered_df = df.query('age > 25 and gender == "F"')
print(filtered_df)
Output:
name age gender score1 score2 score3
1 Jane 30 F 85 90 95
3 Sara 35 F 95 89 90
In this example, we have used the .query()
method to filter the DataFrame based on the ‘age’ and ‘gender’ columns. The expression age > 25 and gender == "F"
is used to select rows where ‘age’ is greater than 25 and ‘gender’ is equal to “F”.
Conclusion:
Filtering data is a crucial aspect of data analysis. Pandas provides several efficient methods for filtering data in a DataFrame, such as boolean indexing and the .query()
method. We hope this article has been helpful in understanding these methods and how to apply them effectively.