Adventures in Machine Learning

Counting Up: How to Find the Sum of Columns in Pandas DataFrame

Finding the Sum of Columns in Pandas DataFrame

As data scientists, one of the primary tasks is to analyze and interpret data. While working with Pandas DataFrame, computing the sum of columns is a frequent requirement.

The sum of columns provides important insights into data trends. For instance, in a sales dataset, the sum of a particular column could provide the total revenue of the company.

Method 1: Find Sum of All Columns

The first method is to compute the sum of all columns in the DataFrame.

We can achieve this using the sum() function, which computes the sum of each column in the DataFrame. The sum() method has an optional parameter ‘axis’, which specifies the axis along which the sum is computed.

If axis is set to 0, the sum is computed column-wise, and if axis is set to 1, the sum is computed row-wise. Therefore, to compute the sum of all columns in a DataFrame, we need to pass axis=0 as a parameter to the sum() method.

Here’s the code that computes the sum of all columns in a DataFrame:

import pandas as pd
# create a sample DataFrame
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]})
# compute the sum of all columns
sum_all_cols = df.sum(axis=0)
print(sum_all_cols)

The output of the above code will be:

A     6
B    15
C    24
dtype: int64

Here, we have computed the sum of all columns using the sum() method with axis=0. We have stored the result in a new series named ‘sum_all_cols’.

The output shows the sum of each column in the DataFrame.

Method 2: Find Sum of Specific Columns

We may not always require the sum of all columns in a DataFrame.

In such cases, we can compute the sum of whichever specific columns we desire. To find the sum of specific columns in a DataFrame, we can use the same sum() method with the addition of column selection.

We select the columns using their column names and pass them as a list. For instance, in the following code snippet, we compute the sum of columns ‘A’ and ‘B’ and return their sum in a new series named ‘sum_AB_cols’.

# compute the sum of specific columns
sum_AB_cols = df[['A', 'B']].sum(axis=1)
print(sum_AB_cols)

The output of the above code will be:

0     5
1     7
2     9
dtype: int64

Here, we have used the sum() method with the columns ‘A’ and ‘B’ to compute their sum. Then, we have passed axis=1 to obtain the sum row-wise.

Finally, we have stored the result in a new series named ‘sum_AB_cols’.

Example 1: Find Sum of All Columns

Let’s take an example to illustrate the process of finding the sum of all columns in a DataFrame.

Suppose we have a sales dataset with columns ‘Product Name’, ‘Sales’, and ‘Profit’. We want to find the total sales and total profit of the company.

Here’s the code that computes the sum of all columns in the dataset:

import pandas as pd
# read the sales dataset
df = pd.read_csv('sales_data.csv')
# compute the sum of all columns
sum_stats = df.sum(axis=0)
# add a new column containing the sum of all columns
df['sum_stats'] = sum_stats
print(df.head())

In the above code, we first read the sales dataset using the read_csv() function. Then, we have used the sum() method with axis=0 to calculate the sum of all columns in the dataset.

Next, we have added a new column to the DataFrame named ‘sum_stats’. This new column contains the sum of all columns.

We have done so by assigning the variable ‘sum_stats’ to this new column. Finally, we have printed the first few rows of the DataFrame using the head() method to inspect the newly added ‘sum_stats’ column.

Example 2: Find Sum of Specific Columns

In addition to finding the sum of all columns, we may need to compute the sum of specific columns in a DataFrame and use it for analysis.

For example, in a production dataset, we may want to compute the total production of a particular machine.

To find the sum of specific columns in a DataFrame, we can use the same sum() method described earlier.

This time, we pass the column names, which we require as a list, instead of using all columns. Also, the axis parameter is set to 1 to obtain the sum of the selected columns’ rows.

Let’s take an example to illustrate the process of finding the sum of specific columns in a DataFrame. Suppose we have a manufacturing dataset with columns ‘Machine No’, ‘Production’, and ‘Rejects’.

We want the total production produced by each machine, so we need to calculate only the sum of the ‘Production’ column. Here’s the code that computes the sum of the ‘Production’ column:

import pandas as pd
# create a sample DataFrame
df = pd.DataFrame({'Machine No': [100, 101, 102, 103],
                   'Production': [200, 300, 150, 250],
                   'Rejects': [10, 20, 5, 15]})
# select the required columns
cols = ['Production']
# compute the sum of specific columns
sum_prod_cols = df[cols].sum(axis=1)
# add a new column containing the sum of specific columns
df['sum_stats'] = sum_prod_cols
print(df.head())

In this code snippet, we created a DataFrame with ‘Machine No’, ‘Production’, and ‘Rejects’ columns. Next, we selected only the ‘Production’ column using the cols variable.

Finally, we used the sum() method with axis=1 and stored the result in sum_prod_cols to get the sum of the production for each row. We then added a new column to the DataFrame named ‘sum_stats’ and assigned the values stored in sum_prod_cols to this column.

Finally, we printed the first few rows of the DataFrame to check the ‘sum_stats’ column’s contents.

Conclusion

Finding the sum of columns in Pandas DataFrame is an essential aspect of data processing and analysis. In this article, we have explored two methods for computing the sum of columns in a DataFrame.

The first method calculates the sum of all columns, and the second method calculates the sum of specific columns chosen by name. Data scientists use the sum of columns in a DataFrame for a variety of reasons, such as identifying trends, calculating revenue generated by a particular product, or measuring the success of marketing campaigns.

We hope that this article has provided valuable insights into this crucial aspect of data analysis and processing.

Additional Resources

The Pandas DataFrame is one of the essential tools in data manipulation and analysis. There are several resources available to help you learn more about data manipulation, including complete documentation, tutorials, and user guides.

The official documentation for Pandas provides comprehensive information about the library, including use cases, data structures, and functions. The documentation is easy to navigate and includes examples to help users understand how to use the different features.

The Pandas website also has a section for tutorials and other learning resources. These tutorials cover a wide range of topics, from basic data manipulation tasks to advanced data analysis techniques.

In addition to the official documentation and tutorials, several online courses teach Pandas DataFrame operation and data analysis. Platforms like Udemy, Coursera, and DataCamp offer courses on Pandas and Python data analysis.

Popular Posts