Adventures in Machine Learning

Mastering Pandas: Techniques for Working with Dataframes

1) Adding Entities to Dataframe Using For Loop

1.1 Appending Dataframe with Textual Values

Pandas provides an easy way to add textual and numerical data to a dataframe using a for loop. Let’s look at two different examples.

Suppose we have a dataframe with three columns: Name, Age, and Gender. We want to add a new row to the dataframe with the following values: “John”, 30, and “Male”.

We can use a list and the append() function to achieve this. Here’s how it works:

import pandas as pd
df = pd.DataFrame(columns=["Name", "Age", "Gender"])
new_row = ["John", 30, "Male"]
df = df.append(pd.Series(new_row, index=df.columns), ignore_index=True)

In this code, we first create an empty dataframe with the required columns. We then create a list with the values we want to add, and use the append() function to add the new row to the dataframe.

The ignore_index=True parameter ensures that the row is added with a new index value, rather than overwriting an existing row. We can use a loop to add multiple rows to the dataframe.

Suppose we have a list of people’s names, ages, and genders:

people = [("Mary", 25, "Female"), ("Tom", 40, "Male"), ("Jane", 35, "Female")]

1.2 Appending Dataframe with Numerical Values

We can use a similar approach to add numerical data to a dataframe. Suppose we have an empty dataframe with five columns: A, B, C, D, and E.

We want to add 10 rows to the dataframe, with values for columns A, B, and C generated from a range. Here’s how we can do it:

df = pd.DataFrame(columns=["A", "B", "C", "D", "E"])
for i in range(10):
    new_row = {"A": i, "B": i*2, "C": i*3, "D": "value1", "E": "value2"}
    df = df.append(new_row, ignore_index=True)
print(df)

In this code, we use the range() function to generate values for columns A, B, and C. We then create a new row with the desired values, and use the append() function to add the row to the dataframe.

The ignore_index=True parameter ensures that each new row is added with a new index value. The output will look like this:

   A   B   C       D       E
0  0   0   0  value1  value2
1  1   2   3  value1  value2
2  2   4   6  value1  value2
3  3   6   9  value1  value2
4  4   8  12  value1  value2
5  5  10  15  value1  value2
6  6  12  18  value1  value2
7  7  14  21  value1  value2
8  8  16  24  value1  value2
9  9  18  27  value1  value2

2) Constructing Input Dataframes

2.1 Constructing Input Dataframe with Textual Values

Suppose we want to construct a dataframe with the names of the Avengers. We can use a for loop and a list to achieve this:

avengers = ["Iron Man", "Captain America", "Thor", "Hulk", "Black Widow", "Hawkeye"]
df = pd.DataFrame(columns=["Name"])
for name in avengers:
    df = df.append({"Name": name}, ignore_index=True)
print(df)

In this code, we first create an empty dataframe with a single column for the Avengers’ names. We then use a for loop to add each name to the dataframe using the append() function.

The ignore_index=True parameter ensures that each new row is added with a new index value. The output will look like this:

              Name
0         Iron Man
1  Captain America
2             Thor
3             Hulk
4      Black Widow
5          Hawkeye

2.2 Constructing Input Dataframe with Numerical Values

Suppose we want to construct a dataframe with randomly generated numbers for a given number of columns and rows. We can achieve this using variable assignments and the Pandas library:

import pandas as pd
import random
num_rows = 5
num_cols = 3
df = pd.DataFrame(columns=["Column " + str(i+1) for i in range(num_cols)])
for i in range(num_cols):
    column_name = "Column " + str(i+1)
    values = [random.randrange(1, 101) for j in range(num_rows)]
    df[column_name] = values
print(df)

In this code, we first define the number of rows and columns we want in the dataframe. We then create an empty dataframe with column names generated using a list comprehension.

We use a for loop to iterate over the columns and generate random values using the randrange() function. We then add the values as a new column to the dataframe using variable assignments and the column name.

The output will look like this:

   Column 1  Column 2  Column 3
0        41        80        53
1        29        89        64
2         7        53        25
3        58        50        89
4        81        24         9

3) Output Dataframes

When working with data, it is important to have the ability to output the contents of a dataframe. In Pandas, we can output the contents of a dataframe with ease by using the print statement.

However, the method used to output numerical and textual values will differ. Let’s take a look at examples of outputting dataframes with textual and numerical values:

3.1 Textual Values Dataframe Output

Suppose we have a dataframe with the following information on superheroes:

import pandas as pd
df = pd.DataFrame({'Name': ['Batman', 'Superman', 'Wonder Woman', 'Flash', 'Aquaman'],
                   'Power': ['Intelligence, strength', 'Strength, flight', 'Strength, agility', 'Speed, reflexes', 'Strength, water breathing'],
                   'Alter Ego': ['Bruce Wayne',  'Clark Kent', 'Diana Prince', 'Barry Allen', 'Arthur Curry']})
print(df)

This code will generate the following dataframe as output:

           Name                         Power      Alter Ego
0        Batman         Intelligence, strength    Bruce Wayne
1      Superman              Strength, flight     Clark Kent
2  Wonder Woman              Strength, agility   Diana Prince
3         Flash               Speed, reflexes    Barry Allen
4       Aquaman   Strength, water breathing    Arthur Curry

We can use the print statement to output this dataframe. Pandas will format the dataframe in a tabular format making it visually pleasing and easy to read.

3.2 Numerical Values Dataframe Output

Suppose we have a dataframe with random values for a given number of columns and rows:

import pandas as pd
import random
num_rows = 5
num_cols = 3
data = {'Column ' + str(i+1): [random.randint(0,100) for j in range(num_rows)] for i in range(num_cols)}
df = pd.DataFrame(data)
print(df)

This code will generate the following dataframe as output:

   Column 1  Column 2  Column 3
0        46        25        65
1        78        11         5
2        63        81        70
3        11        67        52
4        67        76         7

We can use for loops to format the numerical values into a more readable format. For example, we could use a loop to tabulate the data:

print("   ", end="")
for col in df.columns:
    print(f"{col}  ", end="")
print("")
for index, row in df.iterrows():
    print(f"{index} ", end="")
    for col in df.columns:
        print(f"{row[col]:<4} ", end="")
    print("")

This code will output the dataframe in a tabular format:

   Column 1  Column 2  Column 3  
0  46     25     65     
1  78     11     5      
2  63     81     70     
3  11     67     52     
4  67     76     7      

4) Conclusion

In this article, we have discussed a number of techniques for working with dataframes in Pandas. We started by discussing how to add entities to a dataframe using a for loop, and how to construct input dataframes with different types of values.

We then discussed how to output dataframes with both textual and numerical values, using different methods to format the output based on the data type. These techniques are essential for data analysts and data scientists who work with large datasets and want to create, modify, and manipulate dataframes with ease.

Overall, Pandas is a powerful Python library that offers many tools for working with data. By mastering these techniques, you will be able to tackle a wide range of scenarios involving dataframes.

To learn more about dataframes and Pandas, check out AskPython’s comprehensive tutorials and guides. In this article, we learned about important techniques for working with dataframes using the Pandas library in Python.

We discussed how to add entities to a dataframe using a for loop, and how to construct input dataframes with different types of values. We also covered how to output dataframes with both textual and numerical values using the print statement and with loops to format numerical values for better readability in tabular format.

By mastering these techniques, data analysts and data scientists can manipulate dataframes with ease, perform complex data analysis, and gain insights from large datasets. These skills are essential in today’s data-driven world.

Remember, practice makes perfect, and Pandas is an excellent tool to master for anyone working with data.

Popular Posts