Are you a data analyst or an aspiring data analyst looking to work with data using the powerful Python library, Pandas? If so, you’re in the right place.
In this article, we’ll explore two essential topics in Pandas that will make data analysis and manipulation a breeze – finding unique values and creating a Pandas DataFrame.
Finding Unique Values in Pandas and Ignoring NaN Values
When working with large datasets, it’s often necessary to identify unique values to gain insights into the data. However, missing or NaN values can often pose a challenge that requires specialized solutions.
In this section, we’ll explore two examples of how to find unique values in Pandas while ignoring NaN values.
Custom Function for Finding Unique Values
Pandas provides a unique() method that returns the unique values in a Pandas series or column. However, this method does not ignore NaN values.
Suppose we have a Pandas dataframe as shown in the following example:
import pandas as pd
data = {'name': ['Alex', 'Ben', 'Chris', 'Dave', 'Ella', 'Fiona', 'George', 'Hannah'],
'age': [22, 19, 21, 22, 18, 23, 21, 22],
'score': [85, 78, 92, 90, 75, 96, None, None]}
df = pd.DataFrame(data)
We can see that the ‘score’ column has NaN values, as indicated by the ‘None’ values in the dataframe. To find unique values in the ‘score’ column while ignoring NaN values, we can use a custom function as follows:
def unique_vals_ignore_na(col):
vals= col.unique().tolist()
if pd.isna(vals[0]):
vals.pop(0)
return vals
This function first finds the unique values in the column using the unique() method and stores them in a list.
The function then checks if the first value in the list is NaN using the pd.isna() function and removes it if present using the list.pop() method. Finally, the function returns the resulting list of unique values.
Example 1: Finding Unique Values in Pandas Column and Ignoring NaN Values
To find unique values in the ‘score’ column of the dataframe while ignoring NaN values using the custom function, we can do the following:
unique_scores = unique_vals_ignore_na(df['score'])
print(unique_scores)
This should output: `[85.0, 78.0, 92.0, 90.0, 75.0, 96.0]`
Example 2: Finding Unique Values in Pandas Groupby and Ignoring NaN Values
To find unique values in a pandas groupby object while ignoring NaN values using the custom function, we can use the following code snippet:
grouped = df.groupby('age')['score'].apply(lambda x: unique_vals_ignore_na(x))
print(grouped)
This should output:
age
18 [75.0]
19 [78.0]
21 [92.0, 96.0]
22 [85.0, 90.0]
23 [nan]
Name: score, dtype: object
Creating a Pandas DataFrame
Creating a Pandas DataFrame is one of the first steps in working with Pandas for data analysis. We create DataFrames as a way of representing and manipulating data in tabular form.
In this section, we’ll explore how to create a Pandas DataFrame and view its contents.
Importing Required Libraries
Before creating a DataFrame, we may need to import the required libraries, namely Pandas and NumPy. Here’s a code snippet that shows how to import these libraries:
import pandas as pd
import numpy as np
Creating DataFrame
To create a Pandas DataFrame, we can use various methods. One of the easiest ways is to use a dictionary to define our data.
Here’s an example:
data = {'name': ['Alex', 'Ben', 'Chris', 'Dave', 'Ella', 'Fiona', 'George', 'Hannah'],
'age': [22, 19, 21, 22, 18, 23, 21, 22],
'score': [85, 78, 92, 90, 75, 96, None, None]}
df = pd.DataFrame(data)
In this example, we define a dictionary data and use the pd.DataFrame() function to create a Pandas DataFrame from the dictionary. The resulting DataFrame will have columns as specified in the dictionary’s keys and values as specified in the dictionary’s values.
Viewing DataFrame
To view the contents of a DataFrame, we can use the head() and tail() methods. The head() method returns the first n rows of the DataFrame, while the tail() method returns the last n rows of the DataFrame.
print(df.head())
This should output:
name age score
0 Alex 22 85.0
1 Ben 19 78.0
2 Chris 21 92.0
3 Dave 22 90.0
4 Ella 18 75.0
We’ve covered two essential topics in Pandas that will make your journey into data analysis much smoother. By learning how to find unique values while ignoring NaN values and how to create a Pandas DataFrame, you’re well on your way to manipulating and analyzing data with ease.
With practice, these concepts will become second nature to you, enabling you to unlock insights and make data-driven decisions.
3) Displaying Basic Information about a Pandas DataFrame
When working with a Pandas DataFrame, it’s often useful to display basic information about the data to obtain an understanding of its structure and content. In this section, we’ll explore how to display column names, data types, and shape of the DataFrame.
Displaying Column Names
To display the column names of a Pandas DataFrame, we can use the columns attribute. Here’s an example:
import pandas as pd
data = {'name': ['Alex', 'Ben', 'Chris', 'Dave', 'Ella', 'Fiona', 'George', 'Hannah'],
'age': [22, 19, 21, 22, 18, 23, 21, 22],
'score': [85, 78, 92, 90, 75, 96, None, None]}
df = pd.DataFrame(data)
print(df.columns)
This should output: `Index([‘name’, ‘age’, ‘score’], dtype=’object’)`
The columns attribute returns a Pandas Index object that contains the column names.
Displaying Data Types
To display the data types of each column in a Pandas DataFrame, we can use the dtypes attribute. Here’s an example:
print(df.dtypes)
This should output:
name object
age int64
score float64
dtype: object
The dtypes attribute returns a Pandas series that contains the data types of each column.
Displaying Shape of DataFrame
To display the shape or dimensions of a Pandas DataFrame, we can use the shape attribute. Here’s an example:
print(df.shape)
This should output: `(8, 3)`
The shape attribute returns a tuple that contains the number of rows and columns in the DataFrame.
4) Accessing Data in a Pandas DataFrame
Accessing data in a Pandas DataFrame is a critical part of data analysis. In this section, we’ll explore how to access rows and columns using indexing, iloc, and loc.
Accessing Rows and Columns Using Indexing
To access a specific row or column in a Pandas DataFrame using indexing, we can use the iloc attribute. Here’s an example:
import pandas as pd
data = {'name': ['Alex', 'Ben', 'Chris', 'Dave', 'Ella', 'Fiona', 'George', 'Hannah'],
'age': [22, 19, 21, 22, 18, 23, 21, 22],
'score': [85, 78, 92, 90, 75, 96, None, None]}
df = pd.DataFrame(data)
# accessing a specific column by index
print(df.iloc[:, 1])
This should output:
0 22
1 19
2 21
3 22
4 18
5 23
6 21
7 22
Name: age, dtype: int64
The iloc attribute allows us to access data using the numerical index of the row or column. In this example, we access the second column (index 1) using the syntax df.iloc[:, 1], which means accessing all rows and the second column.
Similarly, we can access a specific row using the index of its position as shown below:
# accessing a specific row by index
print(df.iloc[2, :])
This should output:
name Chris
age 21
score 92.0
Name: 2, dtype: object
Accessing Data Using iloc
To access specific rows and columns of a Pandas DataFrame using iloc, we can use the following syntax:
df.iloc[row_start:row_end, col_start:col_end]
This syntax allows us to specify a range of rows and columns to access. Here’s an example:
# accessing a range of rows and columns using iloc
print(df.iloc[2:5, 0:2])
This should output:
name age
2 Chris 21
3 Dave 22
4 Ella 18
In this example, we accessed rows 2 to 4 and columns 0 to 1, which returns all rows between the 2nd and 4th row and the first and second columns.
Accessing Data Using loc
The loc attribute allows us to access data in a Pandas DataFrame using labels or names instead of numerical indexes. Here’s an example:
# accessing a specific row using loc
print(df.loc[3])
This should output:
name Dave
age 22
score 90
Name: 3, dtype: object
In this example, we accessed the row labeled 3. We can also access a specific column using loc as shown below:
# accessing a specific column using loc
print(df.loc[:, 'score'])
This should output:
0 85.0
1 78.0
2 92.0
3 90.0
4 75.0
5 96.0
6 NaN
7 NaN
Name: score, dtype: float64
In this example, we accessed the ‘score’ column using the loc attribute. In conclusion, accessing and displaying basic information in a Pandas DataFrame is essential in data analysis.
Understanding how to access data using indexing, iloc, and loc, as well as displaying column names, data types, and shape, will enable you to manipulate and analyze data with ease. With practice, these concepts will become second nature, allowing you to unlock insights and make data-driven decisions.
5) Modifying Data in a Pandas DataFrame
Modifying data in a Pandas DataFrame is an essential part of data analysis. In this section, we’ll explore how to modify data in specific columns, add a new column to a DataFrame, and drop rows with NaN values.
Modifying Data in Specific Columns
To modify data in specific columns of a Pandas DataFrame, we can use the following syntax:
df['column_name'] = new_values
Here, we replace ‘column_name’ with the name of the column we wish to modify and ‘new_values’ with the new values that we want to assign to the column. Here’s an example:
import pandas as pd
data = {'name': ['Alex', 'Ben', 'Chris', 'Dave', 'Ella', 'Fiona', 'George', 'Hannah'],
'age': [22, 19, 21, 22, 18, 23, 21, 22],
'score': [85, 78, 92, 90, 75, 96, None, None]}
df = pd.DataFrame(data)
# modifying data in the 'score' column
df['score'] = [85, 78, 92, 90, 75, 96, 87, 81]
print(df)
This should output:
name age score
0 Alex 22 85
1 Ben 19 78
2 Chris 21 92
3 Dave 22 90
4 Ella 18 75
5 Fiona 23 96
6 George 21 87
7 Hannah 22 81
In this example, we modified the data in the ‘score’ column by assigning new values to the column.
Adding a New Column to a DataFrame
Adding a new column to a Pandas DataFrame is easy. We can use the following syntax:
df['new_column_name'] = new_column_data
Here, we replace ‘new_column_name’ with the name of the column to be added and ‘new_column_data’ with the values that we want to assign to the new column.
Here’s an example:
# adding a new column to the DataFrame
df['grade'] = ['A', 'C', 'A', 'A', 'C', 'A', 'B', 'B']
print(df)
This should output:
name age score grade
0 Alex 22 85 A
1 Ben 19 78 C
2 Chris 21 92 A
3 Dave 22 90 A
4 Ella 18 75 C
5 Fiona 23 96 A
6 George 21 87 B
7 Hannah 22 81 B
In this example, we added a new column ‘grade’ to the DataFrame and assigned the appropriate values.
Dropping Rows with NaN Values
Sometimes, we may wish to drop rows with NaN values from a Pandas DataFrame. To do this, we can use the following syntax:
df.dropna(axis=0, inplace=True)
Here, the dropna() method drops all rows that have at least one NaN value.
The axis parameter specifies that we want to drop rows (axis=0) with NaN values. The inplace parameter specifies that the DataFrame should be modified directly.
6) Filtering Data in a Pandas DataFrame
Filtering data in a Pandas DataFrame is essential in data analysis. In this section, we’ll explore how to filter data based on values in a specific column, chain multiple filters together, and use lambda functions in filtering.
Filtering Data Based on Values in a Column
To filter a Pandas DataFrame based on values in a specific column, we can use the following syntax:
df[df['column_name'] == value]
Here, we replace ‘column_name’ with the name of the column we wish to filter and ‘value’ with the value that we want to filter by. Here’s an example:
import pandas as pd
data = {'name': ['Alex', 'Ben', 'Chris', 'Dave', 'Ella', 'Fiona', 'George', 'Hannah'],
'age': [22, 19, 21, 22, 18, 23, 21, 22],
'score': [85, 78, 92, 90, 75, 96, 87, 81]}
df = pd.DataFrame(data)
# filtering data in the 'score' column based on a value
filtered_df = df[df['score'] >= 90]
print(filtered_df)
In this example, we filtered the ‘score’ column based on the value 90 and assigned the result to a new dataframe called filtered_df. We can then print the filtered_df to view the results.
Chaining Multiple Filters
To chain multiple filters together, we can use the following syntax:
df[df['column_name1'] == value1 & df['column_name2'] == value2]
Here, we replace ‘column_name1’ and ‘column_name2’ with the names of the columns we wish to filter and ‘value1’ and ‘value2’ with the values that we want to filter by. Here’s an example:
# chaining multiple filters together
filtered_df = df[(df['age'] == 22) & (df['score'] >= 90)]
print(filtered_df)
In this example, we chained two filters together – one to filter the ‘age’ column based on the value 22 and another to filter the ‘score’ column based on the value 90.
Using Lambda Functions in Filtering
Lambda functions can also be used in filtering. Here’s an example:
# using lambda functions in filtering
filtered_df = df[df['age'].apply(lambda x: x > 20)]
print(filtered_df)
In this example, we used a lambda function to filter the ‘age’ column based on the condition that the age is greater than 20. Lambda functions can make filtering more concise and efficient when you need to perform more complex filtering operations.
These are just a few examples of how to filter data in a Pandas DataFrame. The possibilities are endless, and with practice, you’ll be able to filter data with ease and gain valuable insights from your datasets.