Adventures in Machine Learning

Mastering Pandas: Essential Techniques for Data Analysis and Manipulation

Are you a data analyst or an aspiring data analyst looking to work with data using the powerful Python library, Pandas? If so, you’re in the right place.

In this article, we’ll explore two essential topics in Pandas that will make data analysis and manipulation a breeze – finding unique values and creating a Pandas DataFrame.

Finding Unique Values in Pandas and Ignoring NaN Values

When working with large datasets, it’s often necessary to identify unique values to gain insights into the data. However, missing or NaN values can often pose a challenge that requires specialized solutions.

In this section, we’ll explore two examples of how to find unique values in Pandas while ignoring NaN values.

Custom Function for Finding Unique Values

Pandas provides a unique() method that returns the unique values in a Pandas series or column. However, this method does not ignore NaN values.

Suppose we have a Pandas dataframe as shown in the following example:

“`

import pandas as pd

data = {‘name’: [‘Alex’, ‘Ben’, ‘Chris’, ‘Dave’, ‘Ella’, ‘Fiona’, ‘George’, ‘Hannah’],

age’: [22, 19, 21, 22, 18, 23, 21, 22],

‘score’: [85, 78, 92, 90, 75, 96, None, None]}

df = pd.DataFrame(data)

“`

We can see that the ‘score’ column has NaN values, as indicated by the ‘None’ values in the dataframe. To find unique values in the ‘score’ column while ignoring NaN values, we can use a custom function as follows:

“`

def unique_vals_ignore_na(col):

vals= col.unique().tolist()

if pd.isna(vals[0]):

vals.pop(0)

return vals

“`

This function first finds the unique values in the column using the unique() method and stores them in a list.

The function then checks if the first value in the list is NaN using the pd.isna() function and removes it if present using the list.pop() method. Finally, the function returns the resulting list of unique values.

Example 1: Finding Unique Values in Pandas Column and Ignoring NaN Values

To find unique values in the ‘score’ column of the dataframe while ignoring NaN values using the custom function, we can do the following:

“`

unique_scores = unique_vals_ignore_na(df[‘score’])

print(unique_scores)

“`

This should output: `[85.0, 78.0, 92.0, 90.0, 75.0, 96.0]`

Example 2: Finding Unique Values in Pandas Groupby and Ignoring NaN Values

To find unique values in a pandas groupby object while ignoring NaN values using the custom function, we can use the following code snippet:

“`

grouped = df.groupby(‘

age’)[‘score’].apply(lambda x: unique_vals_ignore_na(x))

print(grouped)

“`

This should output:

“`

age

18 [75.0]

19 [78.0]

21 [92.0, 96.0]

22 [85.0, 90.0]

23 [nan]

Name: score, dtype: object

“`

Creating a Pandas DataFrame

Creating a Pandas DataFrame is one of the first steps in working with Pandas for data analysis. We create DataFrames as a way of representing and manipulating data in tabular form.

In this section, we’ll explore how to create a Pandas DataFrame and view its contents.

Importing Required Libraries

Before creating a DataFrame, we may need to import the required libraries, namely Pandas and NumPy. Here’s a code snippet that shows how to import these libraries:

“`

import pandas as pd

import numpy as np

“`

Creating DataFrame

To create a Pandas DataFrame, we can use various methods. One of the easiest ways is to use a dictionary to define our data.

Here’s an example:

“`

data = {‘name’: [‘Alex’, ‘Ben’, ‘Chris’, ‘Dave’, ‘Ella’, ‘Fiona’, ‘George’, ‘Hannah’],

age’: [22, 19, 21, 22, 18, 23, 21, 22],

‘score’: [85, 78, 92, 90, 75, 96, None, None]}

df = pd.DataFrame(data)

“`

In this example, we define a dictionary data and use the pd.DataFrame() function to create a Pandas DataFrame from the dictionary. The resulting DataFrame will have columns as specified in the dictionary’s keys and values as specified in the dictionary’s values.

Viewing DataFrame

To view the contents of a DataFrame, we can use the head() and tail() methods. The head() method returns the first n rows of the DataFrame, while the tail() method returns the last n rows of the DataFrame.

“`

print(df.head())

“`

This should output:

“`

name

age score

0 Alex 22 85.0

1 Ben 19 78.0

2 Chris 21 92.0

3 Dave 22 90.0

4 Ella 18 75.0

“`

We’ve covered two essential topics in Pandas that will make your journey into data analysis much smoother. By learning how to find unique values while ignoring NaN values and how to create a Pandas DataFrame, you’re well on your way to manipulating and analyzing data with ease.

With practice, these concepts will become second nature to you, enabling you to unlock insights and make data-driven decisions.

3) Displaying Basic Information about a Pandas DataFrame

When working with a Pandas DataFrame, it’s often useful to display basic information about the data to obtain an understanding of its structure and content. In this section, we’ll explore how to display column names, data types, and shape of the DataFrame.

Displaying Column Names

To display the column names of a Pandas DataFrame, we can use the columns attribute. Here’s an example:

“`

import pandas as pd

data = {‘name’: [‘Alex’, ‘Ben’, ‘Chris’, ‘Dave’, ‘Ella’, ‘Fiona’, ‘George’, ‘Hannah’],

age’: [22, 19, 21, 22, 18, 23, 21, 22],

‘score’: [85, 78, 92, 90, 75, 96, None, None]}

df = pd.DataFrame(data)

print(df.columns)

“`

This should output: `Index([‘name’, ‘

age’, ‘score’], dtype=’object’)`

The columns attribute returns a Pandas Index object that contains the column names.

Displaying Data Types

To display the data types of each column in a Pandas DataFrame, we can use the dtypes attribute. Here’s an example:

“`

print(df.dtypes)

“`

This should output:

“`

name object

age int64

score float64

dtype: object

“`

The dtypes attribute returns a Pandas series that contains the data types of each column.

Displaying Shape of DataFrame

To display the shape or dimensions of a Pandas DataFrame, we can use the shape attribute. Here’s an example:

“`

print(df.shape)

“`

This should output: `(8, 3)`

The shape attribute returns a tuple that contains the number of rows and columns in the DataFrame.

4) Accessing Data in a Pandas DataFrame

Accessing data in a Pandas DataFrame is a critical part of data analysis. In this section, we’ll explore how to access rows and columns using indexing, iloc, and loc.

Accessing Rows and Columns Using Indexing

To access a specific row or column in a Pandas DataFrame using indexing, we can use the iloc attribute. Here’s an example:

“`

import pandas as pd

data = {‘name’: [‘Alex’, ‘Ben’, ‘Chris’, ‘Dave’, ‘Ella’, ‘Fiona’, ‘George’, ‘Hannah’],

age’: [22, 19, 21, 22, 18, 23, 21, 22],

‘score’: [85, 78, 92, 90, 75, 96, None, None]}

df = pd.DataFrame(data)

# accessing a specific column by index

print(df.iloc[:, 1])

“`

This should output:

“`

0 22

1 19

2 21

3 22

4 18

5 23

6 21

7 22

Name:

age, dtype: int64

“`

The iloc attribute allows us to access data using the numerical index of the row or column. In this example, we access the second column (index 1) using the syntax df.iloc[:, 1], which means accessing all rows and the second column.

Similarly, we can access a specific row using the index of its position as shown below:

“`

# accessing a specific row by index

print(df.iloc[2, :])

“`

This should output:

“`

name Chris

age 21

score 92.0

Name: 2, dtype: object

“`

Accessing Data Using iloc

To access specific rows and columns of a Pandas DataFrame using iloc, we can use the following syntax:

“`

df.iloc[row_start:row_end, col_start:col_end]

“`

This syntax allows us to specify a range of rows and columns to access. Here’s an example:

“`

# accessing a range of rows and columns using iloc

print(df.iloc[2:5, 0:2])

“`

This should output:

“`

name

age

2 Chris 21

3 Dave 22

4 Ella 18

“`

In this example, we accessed rows 2 to 4 and columns 0 to 1, which returns all rows between the 2nd and 4th row and the first and second columns.

Accessing Data Using loc

The loc attribute allows us to access data in a Pandas DataFrame using labels or names instead of numerical indexes. Here’s an example:

“`

# accessing a specific row using loc

print(df.loc[3])

“`

This should output:

“`

name Dave

age 22

score 90

Name: 3, dtype: object

“`

In this example, we accessed the row labeled 3. We can also access a specific column using loc as shown below:

“`

# accessing a specific column using loc

print(df.loc[:, ‘score’])

“`

This should output:

“`

0 85.0

1 78.0

2 92.0

3 90.0

4 75.0

5 96.0

6 NaN

7 NaN

Name: score, dtype: float64

“`

In this example, we accessed the ‘score’ column using the loc attribute. In conclusion, accessing and displaying basic information in a Pandas DataFrame is essential in data analysis.

Understanding how to access data using indexing, iloc, and loc, as well as displaying column names, data types, and shape, will enable you to manipulate and analyze data with ease. With practice, these concepts will become second nature, allowing you to unlock insights and make data-driven decisions.

5) Modifying Data in a Pandas DataFrame

Modifying data in a Pandas DataFrame is an essential part of data analysis. In this section, we’ll explore how to modify data in specific columns, add a new column to a DataFrame, and drop rows with NaN values.

Modifying Data in Specific Columns

To modify data in specific columns of a Pandas DataFrame, we can use the following syntax:

“`

df[‘column_name’] = new_values

“`

Here, we replace ‘column_name’ with the name of the column we wish to modify and ‘new_values’ with the new values that we want to assign to the column. Here’s an example:

“`

import pandas as pd

data = {‘name’: [‘Alex’, ‘Ben’, ‘Chris’, ‘Dave’, ‘Ella’, ‘Fiona’, ‘George’, ‘Hannah’],

age’: [22, 19, 21, 22, 18, 23, 21, 22],

‘score’: [85, 78, 92, 90, 75, 96, None, None]}

df = pd.DataFrame(data)

# modifying data in the ‘score’ column

df[‘score’] = [85, 78, 92, 90, 75, 96, 87, 81]

print(df)

“`

This should output:

“`

name

age score

0 Alex 22 85

1 Ben 19 78

2 Chris 21 92

3 Dave 22 90

4 Ella 18 75

5 Fiona 23 96

6 George 21 87

7 Hannah 22 81

“`

In this example, we modified the data in the ‘score’ column by assigning new values to the column.

Adding a New Column to a DataFrame

Adding a new column to a Pandas DataFrame is easy. We can use the following syntax:

“`

df[‘new_column_name’] = new_column_data

“`

Here, we replace ‘new_column_name’ with the name of the column to be added and ‘new_column_data’ with the values that we want to assign to the new column.

Here’s an example:

“`

# adding a new column to the DataFrame

df[‘grade’] = [‘A’, ‘C’, ‘A’, ‘A’, ‘C’, ‘A’, ‘B’, ‘B’]

print(df)

“`

This should output:

“`

name

age score grade

0 Alex 22 85 A

1 Ben 19 78 C

2 Chris 21 92 A

3 Dave 22 90 A

4 Ella 18 75 C

5 Fiona 23 96 A

6 George 21 87 B

7 Hannah 22 81 B

“`

In this example, we added a new column ‘grade’ to the DataFrame and assigned the appropriate values.

Dropping Rows with NaN Values

Sometimes, we may wish to drop rows with NaN values from a Pandas DataFrame. To do this, we can use the following syntax:

“`

df.dropna(axis=0, inplace=True)

“`

Here, the dropna() method drops all rows that have at least one NaN value.

The axis parameter specifies that we want to drop rows (axis=0) with NaN values. The inplace parameter specifies that the DataFrame should be modified directly.

6) Filtering Data in a Pandas DataFrame

Filtering data in a Pandas DataFrame is essential in data analysis. In this section, we’ll explore how to filter data based on values in a specific column, chain multiple filters together, and use lambda functions in filtering.

Filtering Data Based on Values in a Column

To filter a Pandas DataFrame based on values in a specific column, we can use the following syntax:

“`

df[df[‘column_name’] == value]

“`

Here, we replace ‘column_name’ with the name of the column we wish to filter and ‘value’ with the value that we want to filter by. Here’s an example:

“`

import pandas as pd

data = {‘name’: [‘Alex’, ‘Ben’, ‘Chris’, ‘Dave’, ‘Ella’, ‘Fiona’, ‘George’, ‘Hannah’],

age’: [22, 19, 21, 22, 18, 23, 21, 22],

‘score’: [85, 78, 92, 90, 75, 96,

Popular Posts