Descriptive Statistics Using describe() Function in Pandas
Have you ever wondered how to quickly calculate and view descriptive statistics for your dataset? Using the describe() function in pandas can make this task a breeze! In this article, we will explore the syntax and default output of describe() and discuss how to specify the metrics to calculate, such as mean and standard deviation.
1. Syntax and Default Output
To get started, let’s take a look at the basic syntax for using the describe() function in pandas:
dataframe.describe()
This will calculate and display basic summary statistics for each numeric variable in the dataframe, such as count, mean, standard deviation, minimum, and maximum values. The default output will look something like this:
height weight age
count 5.00 5.00 5.000000
mean 68.00 150.00 35.000000
std 3.26 15.81 10.954451
min 64.00 130.00 22.000000
25% 66.00 140.00 25.000000
50% 68.00 150.00 35.000000
75% 70.00 160.00 45.000000
max 72.00 170.00 50.000000
This output provides a quick overview of the dataset’s numeric variables, including the number of non-missing values (count), the average value (mean), and the variability of the variable (standard deviation).
2. Specifying Metrics to Calculate
Sometimes, we may only be interested in specific metrics for our dataset, such as the mean and standard deviation. Fortunately, we can specify which metrics to calculate using the optional parameter of the describe() function.
For example, if we only want to view the mean and standard deviation for the numeric variables, we can use the syntax:
dataframe.describe().loc[['mean', 'std']]
2.1. Example: Use describe() in Pandas to Only Calculate Mean and Std
Now, let’s see an example of how to use the describe() function in pandas to calculate only the mean and standard deviation for a sample DataFrame.
3. Creating a Sample DataFrame
First, let’s create a sample DataFrame with three numeric variables: height, weight, and age. We can use the pandas DataFrame function to create this dataset:
import pandas as pd
data = {'height': [64, 66, 68, 70, 72],
'weight': [130, 140, 150, 160, 170],
'age': [22, 25, 35, 45, 50]
}
df = pd.DataFrame(data)
print(df)
This will create a DataFrame with five rows and three columns:
height weight age
0 64 130 22
1 66 140 25
2 68 150 35
3 70 160 45
4 72 170 50
4. Viewing the Sample DataFrame
Now that we have our sample DataFrame, let’s take a look at it using the .head() function to view its first five rows:
print(df.head())
This will output:
height weight age
0 64 130 22
1 66 140 25
2 68 150 35
3 70 160 45
4 72 170 50
5. Calculating Descriptive Statistics for Each Numeric Variable
To calculate the summary statistics for each numeric variable in our DataFrame, we can simply use the describe() function:
print(df.describe())
This will display the following output:
height weight age
count 5.000000 5.00000 5.000000
mean 68.000000 150.00000 35.000000
std 3.261901 15.81139 10.954451
min 64.000000 130.00000 22.000000
25% 66.000000 140.00000 25.000000
50% 68.000000 150.00000 35.000000
75% 70.000000 160.00000 45.000000
max 72.000000 170.00000 50.000000
6. Using Syntax to Only Calculate Mean and Standard Deviation
If we only want to calculate the mean and standard deviation for the numeric variables in our DataFrame, we can use the syntax:
print(df.describe().loc[['mean', 'std']])
This will display the following output:
height weight age
mean 68.000000 150.0 35.000000
std 3.261901 15.8 10.954451
7. Output with Mean and Standard Deviation Only
As you can see, the syntax we used only returned the mean and standard deviation for each variable, which can be a quick and easy way to get the information we need.
8. Conclusion
Using the describe() function in pandas is a powerful tool for quickly calculating and viewing summary statistics for your dataset. By default, describe() provides summary statistics for each numeric variable in a DataFrame, including the count, mean, standard deviation, minimum, and maximum values.
However, we can also specify which metrics to calculate using the optional parameter of the function. This technique can save us time and energy when working with large datasets, making the statistical analysis process much more efficient.
In conclusion, using the describe() function in pandas can be an efficient and effective way to quickly calculate and view summary statistics for your dataset. By default, this function provides basic summary statistics for each numeric variable in a DataFrame, but you can also specify which metrics to calculate.
This can save you time and effort when working with large datasets and make the statistical analysis process much more streamlined. With the importance of data analysis increasing in industries across the board, mastering the use of functions like describe() can be vital in ensuring accurate and efficient results.