Adventures in Machine Learning

Mastering Pandas: Essential Techniques for Data Analysis and Manipulation

Are you a data analyst or an aspiring data analyst looking to work with data using the powerful Python library, Pandas? If so, you’re in the right place.

In this article, we’ll explore two essential topics in Pandas that will make data analysis and manipulation a breeze – finding unique values and creating a Pandas DataFrame.

Finding Unique Values in Pandas and Ignoring NaN Values

When working with large datasets, it’s often necessary to identify unique values to gain insights into the data. However, missing or NaN values can often pose a challenge that requires specialized solutions.

In this section, we’ll explore two examples of how to find unique values in Pandas while ignoring NaN values.

Custom Function for Finding Unique Values

Pandas provides a unique() method that returns the unique values in a Pandas series or column. However, this method does not ignore NaN values.

Suppose we have a Pandas dataframe as shown in the following example:

import pandas as pd
data = {'name': ['Alex', 'Ben', 'Chris', 'Dave', 'Ella', 'Fiona', 'George', 'Hannah'],
          'age': [22, 19, 21, 22, 18, 23, 21, 22],
          'score': [85, 78, 92, 90, 75, 96, None, None]}
df = pd.DataFrame(data)

We can see that the ‘score’ column has NaN values, as indicated by the ‘None’ values in the dataframe. To find unique values in the ‘score’ column while ignoring NaN values, we can use a custom function as follows:

def unique_vals_ignore_na(col):
    vals= col.unique().tolist()
    if pd.isna(vals[0]):
      vals.pop(0)
    return vals

This function first finds the unique values in the column using the unique() method and stores them in a list.

The function then checks if the first value in the list is NaN using the pd.isna() function and removes it if present using the list.pop() method. Finally, the function returns the resulting list of unique values.

Example 1: Finding Unique Values in Pandas Column and Ignoring NaN Values

To find unique values in the ‘score’ column of the dataframe while ignoring NaN values using the custom function, we can do the following:

unique_scores = unique_vals_ignore_na(df['score'])

print(unique_scores)

This should output: `[85.0, 78.0, 92.0, 90.0, 75.0, 96.0]`

Example 2: Finding Unique Values in Pandas Groupby and Ignoring NaN Values

To find unique values in a pandas groupby object while ignoring NaN values using the custom function, we can use the following code snippet:

grouped = df.groupby('age')['score'].apply(lambda x: unique_vals_ignore_na(x))

print(grouped)

This should output:

age
18          [75.0]
19          [78.0]
21    [92.0, 96.0]
22    [85.0, 90.0]
23          [nan]
Name: score, dtype: object

Creating a Pandas DataFrame

Creating a Pandas DataFrame is one of the first steps in working with Pandas for data analysis. We create DataFrames as a way of representing and manipulating data in tabular form.

In this section, we’ll explore how to create a Pandas DataFrame and view its contents.

Importing Required Libraries

Before creating a DataFrame, we may need to import the required libraries, namely Pandas and NumPy. Here’s a code snippet that shows how to import these libraries:

import pandas as pd
import numpy as np

Creating DataFrame

To create a Pandas DataFrame, we can use various methods. One of the easiest ways is to use a dictionary to define our data.

Here’s an example:

data = {'name': ['Alex', 'Ben', 'Chris', 'Dave', 'Ella', 'Fiona', 'George', 'Hannah'],
          'age': [22, 19, 21, 22, 18, 23, 21, 22],
          'score': [85, 78, 92, 90, 75, 96, None, None]}
df = pd.DataFrame(data)

In this example, we define a dictionary data and use the pd.DataFrame() function to create a Pandas DataFrame from the dictionary. The resulting DataFrame will have columns as specified in the dictionary’s keys and values as specified in the dictionary’s values.

Viewing DataFrame

To view the contents of a DataFrame, we can use the head() and tail() methods. The head() method returns the first n rows of the DataFrame, while the tail() method returns the last n rows of the DataFrame.

print(df.head())

This should output:

     name  age  score
0    Alex   22   85.0
1     Ben   19   78.0
2   Chris   21   92.0
3    Dave   22   90.0
4    Ella   18   75.0

We’ve covered two essential topics in Pandas that will make your journey into data analysis much smoother. By learning how to find unique values while ignoring NaN values and how to create a Pandas DataFrame, you’re well on your way to manipulating and analyzing data with ease.

With practice, these concepts will become second nature to you, enabling you to unlock insights and make data-driven decisions.

3) Displaying Basic Information about a Pandas DataFrame

When working with a Pandas DataFrame, it’s often useful to display basic information about the data to obtain an understanding of its structure and content. In this section, we’ll explore how to display column names, data types, and shape of the DataFrame.

Displaying Column Names

To display the column names of a Pandas DataFrame, we can use the columns attribute. Here’s an example:

import pandas as pd
data = {'name': ['Alex', 'Ben', 'Chris', 'Dave', 'Ella', 'Fiona', 'George', 'Hannah'],
          'age': [22, 19, 21, 22, 18, 23, 21, 22],
          'score': [85, 78, 92, 90, 75, 96, None, None]}
df = pd.DataFrame(data)
print(df.columns)

This should output: `Index([‘name’, ‘age’, ‘score’], dtype=’object’)`

The columns attribute returns a Pandas Index object that contains the column names.

Displaying Data Types

To display the data types of each column in a Pandas DataFrame, we can use the dtypes attribute. Here’s an example:

print(df.dtypes)

This should output:

name      object
age        int64
score    float64
dtype: object

The dtypes attribute returns a Pandas series that contains the data types of each column.

Displaying Shape of DataFrame

To display the shape or dimensions of a Pandas DataFrame, we can use the shape attribute. Here’s an example:

print(df.shape)

This should output: `(8, 3)`

The shape attribute returns a tuple that contains the number of rows and columns in the DataFrame.

4) Accessing Data in a Pandas DataFrame

Accessing data in a Pandas DataFrame is a critical part of data analysis. In this section, we’ll explore how to access rows and columns using indexing, iloc, and loc.

Accessing Rows and Columns Using Indexing

To access a specific row or column in a Pandas DataFrame using indexing, we can use the iloc attribute. Here’s an example:

import pandas as pd
data = {'name': ['Alex', 'Ben', 'Chris', 'Dave', 'Ella', 'Fiona', 'George', 'Hannah'],
          'age': [22, 19, 21, 22, 18, 23, 21, 22],
          'score': [85, 78, 92, 90, 75, 96, None, None]}
df = pd.DataFrame(data)
# accessing a specific column by index
print(df.iloc[:, 1])

This should output:

0    22
1    19
2    21
3    22
4    18
5    23
6    21
7    22
Name: age, dtype: int64

The iloc attribute allows us to access data using the numerical index of the row or column. In this example, we access the second column (index 1) using the syntax df.iloc[:, 1], which means accessing all rows and the second column.

Similarly, we can access a specific row using the index of its position as shown below:

# accessing a specific row by index
print(df.iloc[2, :])

This should output:

name     Chris
age         21
score      92.0
Name: 2, dtype: object

Accessing Data Using iloc

To access specific rows and columns of a Pandas DataFrame using iloc, we can use the following syntax:

df.iloc[row_start:row_end, col_start:col_end] 

This syntax allows us to specify a range of rows and columns to access. Here’s an example:

# accessing a range of rows and columns using iloc
print(df.iloc[2:5, 0:2])

This should output:

     name  age
2   Chris   21
3    Dave   22
4    Ella   18

In this example, we accessed rows 2 to 4 and columns 0 to 1, which returns all rows between the 2nd and 4th row and the first and second columns.

Accessing Data Using loc

The loc attribute allows us to access data in a Pandas DataFrame using labels or names instead of numerical indexes. Here’s an example:

# accessing a specific row using loc
print(df.loc[3])

This should output:

name     Dave
age        22
score      90
Name: 3, dtype: object

In this example, we accessed the row labeled 3. We can also access a specific column using loc as shown below:

# accessing a specific column using loc
print(df.loc[:, 'score'])

This should output:

0    85.0
1    78.0
2    92.0
3    90.0
4    75.0
5    96.0
6     NaN
7     NaN
Name: score, dtype: float64

In this example, we accessed the ‘score’ column using the loc attribute. In conclusion, accessing and displaying basic information in a Pandas DataFrame is essential in data analysis.

Understanding how to access data using indexing, iloc, and loc, as well as displaying column names, data types, and shape, will enable you to manipulate and analyze data with ease. With practice, these concepts will become second nature, allowing you to unlock insights and make data-driven decisions.

5) Modifying Data in a Pandas DataFrame

Modifying data in a Pandas DataFrame is an essential part of data analysis. In this section, we’ll explore how to modify data in specific columns, add a new column to a DataFrame, and drop rows with NaN values.

Modifying Data in Specific Columns

To modify data in specific columns of a Pandas DataFrame, we can use the following syntax:

df['column_name'] = new_values

Here, we replace ‘column_name’ with the name of the column we wish to modify and ‘new_values’ with the new values that we want to assign to the column. Here’s an example:

import pandas as pd
data = {'name': ['Alex', 'Ben', 'Chris', 'Dave', 'Ella', 'Fiona', 'George', 'Hannah'],
          'age': [22, 19, 21, 22, 18, 23, 21, 22],
          'score': [85, 78, 92, 90, 75, 96, None, None]}
df = pd.DataFrame(data)
# modifying data in the 'score' column
df['score'] = [85, 78, 92, 90, 75, 96, 87, 81]

print(df)

This should output:

     name  age  score
0    Alex   22     85
1     Ben   19     78
2   Chris   21     92
3    Dave   22     90
4    Ella   18     75
5   Fiona   23     96
6  George   21     87
7  Hannah   22     81

In this example, we modified the data in the ‘score’ column by assigning new values to the column.

Adding a New Column to a DataFrame

Adding a new column to a Pandas DataFrame is easy. We can use the following syntax:

df['new_column_name'] = new_column_data

Here, we replace ‘new_column_name’ with the name of the column to be added and ‘new_column_data’ with the values that we want to assign to the new column.

Here’s an example:

# adding a new column to the DataFrame
df['grade'] = ['A', 'C', 'A', 'A', 'C', 'A', 'B', 'B']

print(df)

This should output:

     name  age  score grade
0    Alex   22     85     A
1     Ben   19     78     C
2   Chris   21     92     A
3    Dave   22     90     A
4    Ella   18     75     C
5   Fiona   23     96     A
6  George   21     87     B
7  Hannah   22     81     B

In this example, we added a new column ‘grade’ to the DataFrame and assigned the appropriate values.

Dropping Rows with NaN Values

Sometimes, we may wish to drop rows with NaN values from a Pandas DataFrame. To do this, we can use the following syntax:

df.dropna(axis=0, inplace=True)

Here, the dropna() method drops all rows that have at least one NaN value.

The axis parameter specifies that we want to drop rows (axis=0) with NaN values. The inplace parameter specifies that the DataFrame should be modified directly.

6) Filtering Data in a Pandas DataFrame

Filtering data in a Pandas DataFrame is essential in data analysis. In this section, we’ll explore how to filter data based on values in a specific column, chain multiple filters together, and use lambda functions in filtering.

Filtering Data Based on Values in a Column

To filter a Pandas DataFrame based on values in a specific column, we can use the following syntax:

df[df['column_name'] == value]

Here, we replace ‘column_name’ with the name of the column we wish to filter and ‘value’ with the value that we want to filter by. Here’s an example:

import pandas as pd
data = {'name': ['Alex', 'Ben', 'Chris', 'Dave', 'Ella', 'Fiona', 'George', 'Hannah'],
          'age': [22, 19, 21, 22, 18, 23, 21, 22],
          'score': [85, 78, 92, 90, 75, 96, 87, 81]}
df = pd.DataFrame(data)
# filtering data in the 'score' column based on a value
filtered_df = df[df['score'] >= 90]

print(filtered_df)

In this example, we filtered the ‘score’ column based on the value 90 and assigned the result to a new dataframe called filtered_df. We can then print the filtered_df to view the results.

Chaining Multiple Filters

To chain multiple filters together, we can use the following syntax:

df[df['column_name1'] == value1 & df['column_name2'] == value2]

Here, we replace ‘column_name1’ and ‘column_name2’ with the names of the columns we wish to filter and ‘value1’ and ‘value2’ with the values that we want to filter by. Here’s an example:

# chaining multiple filters together
filtered_df = df[(df['age'] == 22) & (df['score'] >= 90)]

print(filtered_df)

In this example, we chained two filters together – one to filter the ‘age’ column based on the value 22 and another to filter the ‘score’ column based on the value 90.

Using Lambda Functions in Filtering

Lambda functions can also be used in filtering. Here’s an example:

# using lambda functions in filtering
filtered_df = df[df['age'].apply(lambda x: x > 20)]

print(filtered_df)

In this example, we used a lambda function to filter the ‘age’ column based on the condition that the age is greater than 20. Lambda functions can make filtering more concise and efficient when you need to perform more complex filtering operations.

These are just a few examples of how to filter data in a Pandas DataFrame. The possibilities are endless, and with practice, you’ll be able to filter data with ease and gain valuable insights from your datasets.

Popular Posts