Summing Columns and Rows in Pandas DataFrame
If you’re working with large amounts of data, you might need to perform various calculations to analyze it effectively. There are times when you’ll need to group or summarize this data to generate insights and understand the trends.
One of the most common operations is to sum the columns and rows in a Pandas DataFrame. The process can be incredibly simple and straightforward when you understand the syntax and working with Pandas DataFrame.
1) Syntax for Summing Each Column
A DataFrame is a 2D table that stores data in a tabular form. When you want to sum each column in a Pandas DataFrame, you can use the built-in sum() function to calculate the sum across the column axis.
In Pandas DataFrame, the syntax for summing each column is df.sum(axis=0)
, where ‘df’ refers to the DataFrame object, and ‘axis=0’ specifies that the sum operation should be performed along the column axis.
2) Syntax for Summing Each Row
Likewise, if you want to sum each row in a Pandas DataFrame, the syntax is df.sum(axis=1)
. Here, ‘axis=1’ specifies that the sum operation should be performed along the row axis.
3) Steps to Sum Each Column and Row in Pandas DataFrame
Now that you know the syntax for performing column and row sums, let’s explore the steps required to sum each column and row in a Pandas DataFrame in more detail.
Data Preparation
To illustrate this process, let’s take an example of a commission data table, which contains the commission generated by sales reps in a company. The data table might contain columns such as Rep ID, Sales Figures, Commission Percentage, and so on.
Creating a DataFrame
Once you have your data ready to go, the next step is to create a DataFrame object. You can import the Pandas library using the command import pandas as pd
, and then pass your data as an argument to the pd.DataFrame()
function to construct a new DataFrame object. For example:
import pandas as pd
data = {'Rep ID': [1, 2, 3, 4],
'Sales Figures': [10000, 15000, 20000, 18000],
'Commission Percentage': [0.05, 0.10, 0.07, 0.05]}
df = pd.DataFrame(data)
If you were to print the DataFrame df
, you would see the following output:
Rep ID Sales Figures Commission Percentage
0 1 10000 0.05
1 2 15000 0.10
2 3 20000 0.07
3 4 18000 0.05
Summing Each Column and Row in Pandas DataFrame
When you have set up your DataFrame, it’s time to start calculating the sums. If we want to calculate the sum of each column, we can invoke the df.sum(axis=0)
function like this:
column_sums = df.sum(axis=0)
This command will return a new DataFrame object, which contains the sum of each column.
If we use the print()
function to show the column_sums
variable, it will provide the following output:
Rep ID 10.00
Sales Figures 73000.00
Commission Percentage 0.27
dtype: float64
You can see that we have the sum for each column, as well as the datatype. In this example, the Rep ID column was summed and returned as a float, which isn’t very meaningful.
However, the other columns are intuitive and can be reasoned about. Similarly, if you want to calculate the sum of each row, you can use the df.sum(axis=1)
function like this:
row_sums = df.sum(axis=1)
Once again, the row_sums
variable will provide a new DataFrame object containing the sum of each row.
When we print the row_sums
variable, we will get the following output:
0 1005.0
1 1650.0
2 1407.0
3 1110.0
dtype: float64
Here, we have a sum for each row, represented as a float value. You might notice that the index is a sequence of integers starting from zero, which is the default behavior for Pandas DataFrame.
You can change this index, but it’s beyond the scope of this article.
4) Count for each Column and Row in Pandas DataFrame
When working with large datasets, it can be helpful to identify the number of valid (not null) values in each column and row. In Pandas DataFrame, the count()
function can be used to do this.
Syntax for Getting Count of Values for Each Column
To get the count of values for each column in a Pandas DataFrame, we first invoke the df.count()
function on the DataFrame itself with the axis=0
parameter, as shown below:
df_count = df.count(axis=0)
print(df_count)
This code block will display the number of non-null values for each column in the df
DataFrame:
Month 4
Product Type 4
Units Sold 4
Total Sale 4
Row Sum 4
dtype: int64
In this output, we can see that all the columns have four non-null values because that’s the number of rows in our DataFrame.
Syntax for Getting Count of Values for Each Row
To get the count of values for each row, we first invoke the df.count()
function on the DataFrame with the axis=1
parameter, as shown in the following code:
df_count = df.count(axis=1)
print(df_count)
Running this code will display the number of non-null values for each row in the df
DataFrame:
0 4
1 4
2 4
3 4
dtype: int64
In this output, we can see that all the rows have four non-null values because that’s the number of columns in our DataFrame. Counting the null cells in Pandas DataFrame becomes essential when you need to identify missing information or the quality of your data set.
You can also use the isnull()
function in Pandas to identify null values in a DataFrame.
5) Summarizing Count for each Column and Row in Pandas DataFrame
One of the frequent needs in data analysis is to understand the completeness of a dataset. In other words, we need to know how many valid values we have in each column and row of a Pandas DataFrame.
Fortunately, Pandas provides the count()
function that allows us to count the non-null values.
Creating a DataFrame
Let’s take an example of a test score table containing records of students’ marks. The columns of the table are student id, name, subject, and test score.
import pandas as pd
data = {'Student ID': [1, 2, 3, 4, 5],
'Name': ['John', 'Sara', 'Tom', 'Micheal', 'Sarah'],
'Subject': ['Math', 'Science', 'English', 'Math', 'Math'],
'Test Score': [85, 78, 89, 92, 88]}
df = pd.DataFrame(data)
print(df)
When you run this code, you’ll see the following DataFrame:
Student ID Name Subject Test Score
0 1 John Math 85
1 2 Sara Science 78
2 3 Tom English 89
3 4 Micheal Math 92
4 5 Sarah Math 88
Getting Count of Values for Each Column
To obtain the count of values for each column of the table, we use the count()
function for the DataFrame object with axis=0
parameter.
column_count = df.count(axis=0)
print(column_count)
When you run this code, it will show the following output for column count:
Student ID 5
Name 5
Subject 5
Test Score 5
dtype: int64
Here, the count for each column is equal to the total number of records (5) in the DataFrame. This output shows that we don’t have any missing values in the DataFrame.
Getting Count of Values for Each Row
To get the count of values for each row in the DataFrame, we use count()
function with axis=1
parameter.
row_count = df.count(axis=1)
print(row_count)
When you run this code, it will show the following output for row count:
0 4
1 4
2 4
3 4
4 4
dtype: int64
Here, the count for each row is equal to the total number of columns (4) in the DataFrame. This output shows that we don’t have any missing values in the DataFrame.
6) Other Operations on Pandas DataFrame
Apart from counting and summing rows and columns in a Pandas DataFrame, there are numerous other operations you can perform to better understand your dataset.
Average Each Column
You can use the mean()
function in Pandas to calculate the average of each column in a Pandas DataFrame. Consider the previous example, where we had a table of test scores.
Let’s calculate the average test score for each subject.
average_score = df.groupby('Subject').mean()['Test Score']
print(average_score)
When you run this code, it will show the following output for average score:
Subject
English 89.0
Math 88.3
Science 78.0
Name: Test Score, dtype: float64
Here, the average test score for English is 89, for Math is 88.3, and for Science, it’s 78.
Pandas Documentation
Pandas offer a rich set of functions and methods to process data efficiently, making it a popular tool for data scientists and developers. You can refer to the official Pandas documentation for a detailed list of functions and methods available.
The Pandas documentation provides examples, definitions, and usage of every function, allowing you to utilize the full potential of this powerful tool.
In conclusion, Pandas is an essential tool for data analysis and manipulation.
In this article, we discussed how to summarize counts for each column and row in Pandas DataFrame using the count()
function with axis=0
and axis=1
parameters, respectively. We also explored the mean()
function, which you can use to calculate the average of each column in a Pandas DataFrame.
Although we only scratched the surface of the operations you can do with Pandas, it is often sufficient to provide basic insights and analyses. In conclusion, this article has covered the essential operations that you need to perform on Pandas DataFrame while analyzing datasets.
We have learned how to sum each column and row of a DataFrame using the sum()
function with axis=0
and axis=1
parameters, respectively. We have also seen how to summarize counts for each column and row in Pandas DataFrame using the count()
function with axis=0
and axis=1
parameters, respectively.
Additionally, we explored how to calculate the average of each column in a Pandas DataFrame using the mean()
function. These operations provide a better understanding of data, which is crucial for making informed decisions in various fields like finance, marketing, healthcare, and more.
By utilizing these Pandas operations, you can improve data analysis efficiency and