Adventures in Machine Learning

Mastering Pandas DataFrame: Creation Manipulation and Cleaning

Python is an incredibly versatile programming language that can be used for a wide range of tasks. In particular, it has gained popularity in the data science community due to its extensive libraries for data manipulation and analysis.

One of the most powerful libraries for data manipulation is Pandas, which provides data structures and functions that allow users to easily handle and manipulate tabular data. One of the primary data structures in Pandas is the DataFrame, which is a two-dimensional table-like structure with columns and rows.

In this article, we’ll explore how to create DataFrames from various data structures, including Python lists, dictionaries, sets, tuples, and ndarrays.

Converting Python Lists to Pandas DataFrames

One of the most common ways to create a DataFrame is to convert a Python list. We can do this using the DataFrame constructor, which takes a list as the primary argument.

For example, the following code creates a DataFrame from a list of integers:

import pandas as pd
my_list = [1, 2, 3, 4, 5]
df = pd.DataFrame(my_list, columns=['Numbers'])
print(df)

Output:

   Numbers
0        1
1        2
2        3
3        4
4        5

As you can see, this creates a simple DataFrame with one column and five rows. We can also customize the column name by passing a list of column names as the “columns” argument.

Customizing Column Names

To customize the column names while creating a DataFrame from a list, we can pass a list of column names as the second argument to the DataFrame constructor. For example, the following code creates a DataFrame from a list of strings and customizes the column name:

import pandas as pd
my_list = ['apple', 'banana', 'cherry']
df = pd.DataFrame(my_list, columns=['Fruit'])
print(df)

Output:

     Fruit
0    apple
1   banana
2   cherry

Customizing Row Index

We can also customize the row index of the DataFrame using the “index” parameter in the constructor. For example, the following code creates a DataFrame from a list of integers and sets the row index to start from 1 instead of 0:

import pandas as pd
my_list = [10, 20, 30, 40, 50]
df = pd.DataFrame(my_list, columns=['Numbers'], index=range(1, 6))
print(df)

Output:

   Numbers
1       10
2       20
3       30
4       40
5       50

Changing Data Types

While creating a DataFrame from a list, we may need to change the data type of the elements in the list. We can do this using the “dtype” parameter in the constructor.

For example, the following code creates a DataFrame from a list of integers but sets the data type to “float”:

import pandas as pd
my_list = [1, 2, 3, 4, 5]
df = pd.DataFrame(my_list, columns=['Numbers'], dtype=float)
print(df.dtypes)

Output:

Numbers    float64
dtype: object

Creating DataFrame from Hierarchical Lists as Rows

We may also have hierarchical lists where each sub-list represents one row of the DataFrame. We can create a DataFrame from hierarchical lists using the “from_records” method of the DataFrame constructor.

For example, the following code creates a DataFrame from a list of sub-lists:

import pandas as pd
my_list = [[1, 'apple'], [2, 'banana'], [3, 'cherry']]
df = pd.DataFrame.from_records(my_list, columns=['Numbers', 'Fruit'])
print(df)

Output:

   Numbers   Fruit
0        1   apple
1        2  banana
2        3  cherry

Creating DataFrame from Hierarchical Lists as Columns

Alternatively, we may have hierarchical lists where each sub-list represents one column of the DataFrame. We can create a DataFrame from such hierarchical lists using the “zip” function.

For example, the following code creates a DataFrame from two sub-lists:

import pandas as pd
numbers = [1, 2, 3, 4, 5]
fruits = ['apple', 'banana', 'cherry', 'orange', 'pear']
df = pd.DataFrame(list(zip(numbers, fruits)), columns=['Numbers', 'Fruit'])
print(df)

Output:

   Numbers   Fruit
0        1   apple
1        2  banana
2        3  cherry
3        4  orange
4        5    pear

Creating DataFrame from Multiple Lists

We may also need to create a DataFrame from multiple lists, each representing a column in the DataFrame. We can do this by passing a dictionary to the DataFrame constructor, where the keys represent the column names and the values represent the lists.

Alternatively, we can use the “zip” function to create tuples from multiple lists and then pass the tuples to the constructor. For example, the following code creates a DataFrame from three lists using the dictionary method:

import pandas as pd
numbers = [1, 2, 3, 4, 5]
fruits = ['apple', 'banana', 'cherry', 'orange', 'pear']
quantities = [5, 4, 3, 2, 1]
df = pd.DataFrame({'Numbers': numbers, 'Fruit': fruits, 'Quantity': quantities})
print(df)

Output:

   Numbers   Fruit  Quantity
0        1   apple         5
1        2  banana         4
2        3  cherry         3
3        4  orange         2
4        5    pear         1

Conclusion

In summary, Pandas’ DataFrame is a powerful tool for data manipulation and analysis, and it can be created from a variety of data structures, such as lists, dictionaries, sets, tuples, and ndarrays. With this knowledge, you can easily import data from a variety of sources into Pandas’ DataFrame and perform your analysis with ease.

Adding Rows and Columns to DataFrame

The flexibility of Pandas’ DataFrame allows us to add rows and columns to an existing DataFrame with ease. In this section, we’ll discuss how to add rows and columns to a DataFrame.

Adding Rows to DataFrame

To add a new row to an existing DataFrame, we can use the “append” method. This method takes a “Series” object as input, where each element of the Series object represents a value in the new row.

If the Series object has an index, the index values will be used as column names. For example, the following code adds a new row to an existing DataFrame:

import pandas as pd
data = {'Name': ['Alice', 'Bob'], 'Age': [25, 30]}
df = pd.DataFrame(data)
new_row = pd.Series({'Name': 'Charlie', 'Age': 35})
df = df.append(new_row, ignore_index=True)
print(df)

Output:

       Name  Age
0     Alice   25
1       Bob   30
2   Charlie   35

In the above code, we first create a DataFrame with two rows. Then, we create a new Series object with the values for the new row.

Finally, we append the new row to the DataFrame using the “append” method.

Adding Columns to DataFrame

To add a new column to an existing DataFrame, we can use the “insert” method. This method takes the index position where we want to add the new column, the name of the new column, and the values for the new column.

For example, the following code adds a new column to an existing DataFrame:

import pandas as pd
data = {'Name': ['Alice', 'Bob'], 'Age': [25, 30]}
df = pd.DataFrame(data)
new_col = [True, False]
df.insert(loc=2, column='IsMarried', value=new_col)
print(df)

Output:

      Name  Age  IsMarried
0    Alice   25       True
1      Bob   30      False

In the above code, we first create a DataFrame with two columns. Then, we create a new list for the values of the new column.

Finally, we add the new column to the DataFrame using the “insert” method.

Adding Rows Using DataFrame.append()

We can also use the “append” method to add multiple rows to a DataFrame at once.

For example, the following code appends two new rows to an existing DataFrame:

import pandas as pd
data = {'Name': ['Alice', 'Bob'], 'Age': [25, 30]}
df = pd.DataFrame(data)
new_data = [{'Name': 'Charlie', 'Age': 35}, {'Name': 'Dave', 'Age': 20}]
df = df.append(new_data, ignore_index=True)
print(df)

Output:

      Name  Age
0    Alice   25
1      Bob   30
2  Charlie   35
3     Dave   20

In the above code, we first create a DataFrame with two rows. Then, we create a list of new data where each item in the list represents a new row.

Finally, we append the new rows to the existing DataFrame using the “append” method.

Adding Columns Using DataFrame.insert()

We can also add columns to an existing DataFrame using the “insert” method.

Additionally, we can specify where we want to add the new column and what values we want to use to fill the new column. For example, the following code adds a new column at position 2 with values “Yes” and “No”:

import pandas as pd
data = {'Name': ['Alice', 'Bob'], 'Age': [25, 30]}
df = pd.DataFrame(data)
yes_no = ['Yes', 'No']
df.insert(loc=2, column='Married', value=yes_no)
print(df)

Output:

      Name  Age Married
0    Alice   25     Yes
1      Bob   30      No

In the above code, we first create a DataFrame with two columns. Then, we create a list of values for the new column.

Finally, we add the new column to the DataFrame at position 2 using the “insert” method.

Accessing and Changing DataFrame Elements

Once we have created or modified a DataFrame, we may need to access and change the values of specific elements within the DataFrame. Here we describe some useful methods for accessing and changing DataFrame elements.

Accessing Elements Using loc[]

The “loc” method can be used to select rows and columns by label. For example, the following code accesses the value at row 0 and column “Age” of a DataFrame:

import pandas as pd
data = {'Name': ['Alice', 'Bob'], 'Age': [25, 30]}
df = pd.DataFrame(data)
value = df.loc[0, 'Age']
print(value)

Output:

25

In the above code, we first create a DataFrame with two rows and two columns. Then, we use the “loc” method to access the value at row 0 and column “Age”.

Accessing Elements Using iloc[]

The “iloc” method can be used to select rows and columns by index. For example, the following code accesses the value at row 0 and column 1 of a DataFrame:

import pandas as pd
data = {'Name': ['Alice', 'Bob'], 'Age': [25, 30]}
df = pd.DataFrame(data)
value = df.iloc[0, 1]
print(value)

Output:

25

In the above code, we first create a DataFrame with two rows and two columns. Then, we use the “iloc” method to access the value at row 0 and column 1.

Accessing Elements Using at[]

The “at” method can be used to select a single value by label. For example, the following code accesses the value at row 0 and column “Age” of a DataFrame:

import pandas as pd
data = {'Name': ['Alice', 'Bob'], 'Age': [25, 30]}
df = pd.DataFrame(data)
value = df.at[0, 'Age']
print(value)

Output:

25

In the above code, we first create a DataFrame with two rows and two columns. Then, we use the “at” method to access the value at row 0 and column “Age”.

Accessing Elements Using iat[]

The “iat” method can be used to select a single value by index. For example, the following code accesses the value at row 0 and column 1 of a DataFrame:

import pandas as pd
data = {'Name': ['Alice', 'Bob'], 'Age': [25, 30]}
df = pd.DataFrame(data)
value = df.iat[0, 1]
print(value)

Output:

25

In the above code, we first create a DataFrame with two rows and two columns. Then, we use the “iat” method to access the value at row 0 and column 1.

Changing DataFrame Values Using loc[]

The “loc” method can also be used to change the values of DataFrame elements. For example, the following code changes the value at row 0 and column “Age” of a DataFrame:

import pandas as pd
data = {'Name': ['Alice', 'Bob'], 'Age': [25, 30]}
df = pd.DataFrame(data)
df.loc[0, 'Age'] = 27
print(df)

Output:

    Name  Age
0  Alice   27
1    Bob   30

In the above code, we first create a DataFrame with two rows and two columns. Then, we use the “loc” method to change the value at row 0 and column “Age” to 27.

Changing DataFrame Values Using at[]

The “at” method can also be used to change the value of a single element in a DataFrame. For example, the following code changes the value at row 0 and column “Age” of a DataFrame:

import pandas as pd
data = {'Name': ['Alice', 'Bob'], 'Age': [25, 30]}
df = pd.DataFrame(data)
df.at[0, 'Age'] = 27
print(df)

Output:

    Name  Age
0  Alice   27
1    Bob   30

In the above code, we first create a DataFrame with two rows and two columns. Then, we use the “at” method to change the value at row 0 and column “Age” to 27.

Conclusion

In this article, we discussed how to add rows and columns to an existing DataFrame using the “append” and “insert” methods. We also described useful methods for accessing and changing DataFrame elements, including “loc”, “iloc”, “at”, and “iat”.

With these tools, you can create and manipulate complex data sets with ease and precision.

Data Cleaning in DataFrame

Data cleaning is crucial in data analysis, as data can often be incomplete or inaccurate. In Pandas, DataFrames can have missing or null values, and these need to be dealt with before further analysis.

In this section, we’ll discuss some methods for cleaning data in a Pandas DataFrame.

Dropping Null Values from DataFrame

One way to handle null values in a DataFrame is to drop them entirely. We can do this using the “dropna” method.

This method removes any rows that have null values. For example, the following code drops null values from a DataFrame:

import pandas as pd
import numpy as np
data = {'Name': ['Alice', 'Bob', np.nan], 'Age': [25, 30, np.nan]}
df = pd.DataFrame(data)
df = df.dropna()
print(df)

Output:

    Name   Age
0  Alice  25.0
1    Bob  30.0

In the above code, we first create a DataFrame with two columns and three rows, one of which has null values.

Popular Posts