Adventures in Machine Learning

Effortlessly Analyze Your Data with pandas’ describe() Function

Descriptive Statistics Using describe() Function in Pandas

Have you ever wondered how to quickly calculate and view descriptive statistics for your dataset? Using the describe() function in pandas can make this task a breeze! In this article, we will explore the syntax and default output of describe() and discuss how to specify the metrics to calculate, such as mean and standard deviation.

1. Syntax and Default Output

To get started, let’s take a look at the basic syntax for using the describe() function in pandas:

dataframe.describe()

This will calculate and display basic summary statistics for each numeric variable in the dataframe, such as count, mean, standard deviation, minimum, and maximum values. The default output will look something like this:

       height  weight        age
count    5.00     5.00   5.000000
mean    68.00   150.00  35.000000
std      3.26    15.81  10.954451
min     64.00   130.00  22.000000
25%     66.00   140.00  25.000000
50%     68.00   150.00  35.000000
75%     70.00   160.00  45.000000
max     72.00   170.00  50.000000

This output provides a quick overview of the dataset’s numeric variables, including the number of non-missing values (count), the average value (mean), and the variability of the variable (standard deviation).

2. Specifying Metrics to Calculate

Sometimes, we may only be interested in specific metrics for our dataset, such as the mean and standard deviation. Fortunately, we can specify which metrics to calculate using the optional parameter of the describe() function.

For example, if we only want to view the mean and standard deviation for the numeric variables, we can use the syntax:

dataframe.describe().loc[['mean', 'std']]

2.1. Example: Use describe() in Pandas to Only Calculate Mean and Std

Now, let’s see an example of how to use the describe() function in pandas to calculate only the mean and standard deviation for a sample DataFrame.

3. Creating a Sample DataFrame

First, let’s create a sample DataFrame with three numeric variables: height, weight, and age. We can use the pandas DataFrame function to create this dataset:

import pandas as pd
data = {'height': [64, 66, 68, 70, 72],
        'weight': [130, 140, 150, 160, 170],
        'age': [22, 25, 35, 45, 50]
       }
df = pd.DataFrame(data)
print(df)

This will create a DataFrame with five rows and three columns:

   height  weight  age
0      64     130   22
1      66     140   25
2      68     150   35
3      70     160   45
4      72     170   50

4. Viewing the Sample DataFrame

Now that we have our sample DataFrame, let’s take a look at it using the .head() function to view its first five rows:

print(df.head())

This will output:

   height  weight  age
0      64     130   22
1      66     140   25
2      68     150   35
3      70     160   45
4      72     170   50

5. Calculating Descriptive Statistics for Each Numeric Variable

To calculate the summary statistics for each numeric variable in our DataFrame, we can simply use the describe() function:

print(df.describe())

This will display the following output:

          height    weight        age
count   5.000000    5.00000   5.000000
mean   68.000000  150.00000  35.000000
std     3.261901   15.81139  10.954451
min    64.000000  130.00000  22.000000
25%    66.000000  140.00000  25.000000
50%    68.000000  150.00000  35.000000
75%    70.000000  160.00000  45.000000
max    72.000000  170.00000  50.000000

6. Using Syntax to Only Calculate Mean and Standard Deviation

If we only want to calculate the mean and standard deviation for the numeric variables in our DataFrame, we can use the syntax:

print(df.describe().loc[['mean', 'std']])

This will display the following output:

         height  weight        age
mean   68.000000   150.0  35.000000
std     3.261901    15.8  10.954451

7. Output with Mean and Standard Deviation Only

As you can see, the syntax we used only returned the mean and standard deviation for each variable, which can be a quick and easy way to get the information we need.

8. Conclusion

Using the describe() function in pandas is a powerful tool for quickly calculating and viewing summary statistics for your dataset. By default, describe() provides summary statistics for each numeric variable in a DataFrame, including the count, mean, standard deviation, minimum, and maximum values.

However, we can also specify which metrics to calculate using the optional parameter of the function. This technique can save us time and energy when working with large datasets, making the statistical analysis process much more efficient.

In conclusion, using the describe() function in pandas can be an efficient and effective way to quickly calculate and view summary statistics for your dataset. By default, this function provides basic summary statistics for each numeric variable in a DataFrame, but you can also specify which metrics to calculate.

This can save you time and effort when working with large datasets and make the statistical analysis process much more streamlined. With the importance of data analysis increasing in industries across the board, mastering the use of functions like describe() can be vital in ensuring accurate and efficient results.

Popular Posts