Adventures in Machine Learning

Effortlessly Analyze Your Data with pandas’ describe() Function

Descriptive Statistics Using describe() Function in Pandas

Have you ever wondered how to quickly calculate and view descriptive statistics for your dataset? Using the describe() function in pandas can make this task a breeze! In this article, we will explore the syntax and default output of describe() and discuss how to specify the metrics to calculate, such as mean and standard deviation.

Syntax and Default Output

To get started, let’s take a look at the basic syntax for using the describe() function in pandas:

dataframe.describe()

This will calculate and display basic summary statistics for each numeric variable in the dataframe, such as count, mean, standard deviation, minimum, and maximum values. The default output will look something like this:

height weight age

count 5.00 5.00 5.000000

mean 68.00 150.00 35.000000

std 3.26 15.81 10.954451

min 64.00 130.00 22.000000

25% 66.00 140.00 25.000000

50% 68.00 150.00 35.000000

75% 70.00 160.00 45.000000

max 72.00 170.00 50.000000

This output provides a quick overview of the dataset’s numeric variables, including the number of non-missing values (count), the average value (mean), and the variability of the variable (standard deviation).

Specifying Metrics to Calculate

Sometimes, we may only be interested in specific metrics for our dataset, such as the mean and standard deviation. Fortunately, we can specify which metrics to calculate using the optional parameter of the describe() function.

For example, if we only want to view the mean and standard deviation for the numeric variables, we can use the syntax:

dataframe.describe().loc[[‘mean’, ‘std’]]

This will return a modified output that includes only the mean and standard deviation:

height weight age

mean 68.000000 150.000 35.000000

std 3.258097 15.811 10.954451

This technique can be useful if we want to quickly compare the variability of two different datasets, or if we only need specific summary statistics for our calculations. Example: Use describe() in Pandas to Only Calculate Mean and Std

Now, let’s see an example of how to use the describe() function in pandas to calculate only the mean and standard deviation for a sample DataFrame.

Creating a Sample DataFrame

First, let’s create a sample DataFrame with three numeric variables: height, weight, and age. We can use the pandas DataFrame function to create this dataset:

import pandas as pd

data = {‘height’: [64, 66, 68, 70, 72],

‘weight’: [130, 140, 150, 160, 170],

‘age’: [22, 25, 35, 45, 50]

}

df = pd.DataFrame(data)

print(df)

This will create a DataFrame with five rows and three columns:

height weight age

0 64 130 22

1 66 140 25

2 68 150 35

3 70 160 45

4 72 170 50

Viewing the Sample DataFrame

Now that we have our sample DataFrame, let’s take a look at it using the .head() function to view its first five rows:

print(df.head())

This will output:

height weight age

0 64 130 22

1 66 140 25

2 68 150 35

3 70 160 45

4 72 170 50

Calculating Descriptive Statistics for Each Numeric Variable

To calculate the summary statistics for each numeric variable in our DataFrame, we can simply use the describe() function:

print(df.describe())

This will display the following output:

height weight age

count 5.000000 5.00000 5.000000

mean 68.000000 150.00000 35.000000

std 3.261901 15.81139 10.954451

min 64.000000 130.00000 22.000000

25% 66.000000 140.00000 25.000000

50% 68.000000 150.00000 35.000000

75% 70.000000 160.00000 45.000000

max 72.000000 170.00000 50.000000

Using Syntax to Only Calculate Mean and Standard Deviation

If we only want to calculate the mean and standard deviation for the numeric variables in our DataFrame, we can use the syntax:

print(df.describe().loc[[‘mean’, ‘std’]])

This will display the following output:

height weight age

mean 68.000000 150.0 35.000000

std 3.261901 15.8 10.954451

Output with Mean and Standard Deviation Only

As you can see, the syntax we used only returned the mean and standard deviation for each variable, which can be a quick and easy way to get the information we need.

Conclusion

Using the describe() function in pandas is a powerful tool for quickly calculating and viewing summary statistics for your dataset. By default, describe() provides summary statistics for each numeric variable in a DataFrame, including the count, mean, standard deviation, minimum, and maximum values.

However, we can also specify which metrics to calculate using the optional parameter of the function. This technique can save us time and energy when working with large datasets, making the statistical analysis process much more efficient.

In conclusion, using the describe() function in pandas can be an efficient and effective way to quickly calculate and view summary statistics for your dataset. By default, this function provides basic summary statistics for each numeric variable in a DataFrame, but you can also specify which metrics to calculate.

This can save you time and effort when working with large datasets and make the statistical analysis process much more streamlined. With the importance of data analysis increasing in industries across the board, mastering the use of functions like describe() can be vital in ensuring accurate and efficient results.