Descriptive Statistics in Pandas: An Informative Overview
As data analysis and visualization play a critical role in various businesses and industries, the importance of descriptive statistics cannot be overstated. Descriptive statistics are a set of techniques used to quantify and summarize the characteristics of a dataset.
Pandas, a popular Python library for data manipulation and analysis, offers several functions and methods to compute descriptive statistics on data frames. In this article, we’ll explore the basic concepts and applications of descriptive statistics in Pandas, with a special focus on using the describe() method for categorical and numeric variables.
Using the describe() method for Numeric Variables
Let’s start by discussing how the describe() method works for numeric variables. This method computes several descriptive statistics for a given dataframe, including count, mean, standard deviation, minimum and maximum values, and percentiles.
The result is returned as a new DataFrame, with each of these statistics shown as a row. Here’s an example to help clarify:
Suppose we have a dataset containing the ages of ten students, as follows:
import pandas as pd
data = {"age": [23, 21, 22, 24, 25, 23, 26, 23, 22, 24]}
df = pd.DataFrame(data)
To compute the basic descriptive statistics for these ages, we can simply apply the describe() method:
print(df.describe())
The output of the describe() method for this dataset would be as follows:
age
count 10.000000
mean 23.300000
std 1.509231
min 21.000000
25% 22.250000
50% 23.000000
75% 24.000000
max 26.000000
As you can see, the output provides us with several key values. Count gives us the number of values in the dataset, mean is the average age, std is the standard deviation, min and max are the minimum and maximum ages, while the percentiles indicate values below which those percentages of observations fall.
Using the describe() Method for Categorical Variables
The describe() method can also be used to summarize categorical variables. In this case, the method returns different types of statistics than it does for numeric variables.
The primary statistics computed are count, unique, top, and freq. Here’s an example to illustrate this:
Suppose we have a dataset containing a list of countries, as follows:
data = {"countries": ["China", "India", "USA", "Russia", "Mexico", "USA", "Mexico", "Mexico", "India"]}
df = pd.DataFrame(data)
To compute the basic descriptive statistics for these countries, we can apply the describe() method:
print(df.describe())
The output of the describe() method for this dataset would be as follows:
countries
count 9
unique 5
top Mexico
freq 3
Here, you can see that count is the number of values in the dataset, unique is the number of unique countries, top is the most frequent country, and freq is the frequency of that country. Example 1: Descriptive Statistics for Categorical Variables
Let’s consider another example where we use the describe() method for only categorical variables.
Consider this dataset that contains the colors of shirts sold in a store and the number of shirts sold for each color, as follows:
data = {"color": ["white", "black", "green", "red", "blue"],
"num_shirts_sold": [50, 75, 20, 30, 45]}
df = pd.DataFrame(data)
To see the basic descriptive statistics for the categorical variable color, we can apply describe() method with an attribute include=’object’ like this:
print(df.describe(include='object'))
The output of the describe() method for this dataset would be as follows:
color
count 5
unique 5
top black
freq 1
Here, you can see that the count, unique, top, and freq statistics correspond to the characteristics of the color variable. Count is the number of colors in the dataset, unique is the number of unique colors, top is the most frequently occurring color, and freq is the frequency of that color.
Conclusion
Descriptive statistics are an essential tool for summarizing our data efficiently. Pandas has provided us with several functions and methods to compute descriptive statistics on data frames.
Basic descriptive statistics for numeric variables includes count, mean, standard deviation, minimum and maximum values, and percentiles. On the other hand, for categorical variables, the primary statistics computed are count, unique, top, and freq.
These descriptive statistics help us understand the underlying structure of our data so that we can derive actionable insights. By using the describe() method in Pandas, we can create more effective data visualizations and gain a deeper understanding of our data.
Example 2: Descriptive Statistics for All Variables
In the previous examples, we have explored how to use the describe() method for numeric and categorical variables separately. However, in an actual dataset, it is common to have a mixture of both types of variables.
In such cases, we can use the describe() method to compute the summary statistics for all variables simultaneously. Here is an example dataset containing the ages and genders of ten individuals:
import pandas as pd
data = {"age": [23, 21, 22, 24, 25, 23, 26, 23, 22, 24],
"gender": ["M", "F", "M", "M", "F", "F", "F", "M", "M", "F"]}
df = pd.DataFrame(data)
To compute the summary statistics for all variables, we can simply apply the describe() method without specifying any specific columns:
print(df.describe())
The output of the describe() method for this dataset would be as follows:
age
count 10.000000
mean 23.300000
std 1.509231
min 21.000000
25% 22.250000
50% 23.000000
75% 24.000000
max 26.000000
gender
count 10
unique 2
top F
freq 5
Here, you can see that the output shows us the summary statistics for all variables in our dataset. The numeric variable age has the same statistics as in the Example 1: Descriptive Statistics for Numeric Variables, while the categorical variable gender shows count, unique, top, and freq statistics as we have explored in the Example 1: Descriptive Statistics for Categorical Variables.
Additional Resources
Pandas is a powerful and versatile library for data manipulation and analysis. In addition to the describe() method, there are numerous other functions and methods that can be used for common operations on data frames.
Here are a few examples of popular tutorials and resources on pandas and data analysis:
- Pandas Documentation: The official pandas documentation is an excellent resource for learning about the library’s functionality, including data structures, input/output, and data manipulation.
- Kaggle Tutorials: Kaggle is a popular platform for data science competitions, and it also offers numerous tutorials on pandas and other data analysis tools. Their tutorials cover a wide range of topics, from basic operations to advanced machine learning techniques.
- DataCamp: DataCamp provides interactive coding challenges and courses on several programming languages, including Python.
Their courses on pandas cover basic and advanced topics, such as data manipulation, visualization, and time-series analysis. In conclusion, descriptive statistics are essential to understand the underlying structure of a dataset, and pandas provides several functions and methods to compute these statistics efficiently.
The describe() method in pandas is a quick and easy way to compute summary statistics for numeric and categorical variables separately or all together. In addition, there are numerous resources and tutorials available to help you learn more about pandas and other data analysis tools.
In conclusion, descriptive statistics are a critical tool for summarizing datasets and providing insights for data analysis and visualization. Pandas, a powerful Python library, offers several functions and methods to compute these statistics efficiently.
The describe() method in pandas allows users to quickly compute summary statistics for both numeric and categorical variables. It’s essential to understand the underlying structure of a dataset and to use appropriate methods to gain actionable insights.
By taking advantage of the numerous resources and tutorials available, readers can further develop their skills in data analysis and apply them effectively in their professional or personal projects. Overall, mastering descriptive statistics in pandas is a valuable skill that can help analysts and researchers make informed decisions.