Adventures in Machine Learning

Mastering Frequency Counting in Pandas for Data Analysis

Counting Frequency of Unique Values in Pandas Series

Pandas is one of the most popular data manipulation libraries in Python. It provides easy-to-use tools for data analysis, including functions for counting the frequency of unique values in a pandas series.

In this article, we will explore how to use these functions to count the frequency of unique values, NaN values, relative frequency, frequency in equal-sized bins, and frequency of values in pandas dataframes.

Using value_counts() Function to Count Frequency

In pandas, the value_counts() function is used to count the frequency of unique values in a series. For instance, consider a pandas series with the following data:

import pandas as pd

data = pd.Series([3, 4, 5, 2, 4, 2, 6, 7, 3, 5, 6])

To count the frequency of unique values in the series, we can use the value_counts() function as follows:

freq = data.value_counts()

print(freq)

Output:

4 2

3 2

5 2

6 2

2 2

7 1

dtype: int64

The output shows the frequency of unique values in descending order.

In this case, the value 4, 3, 5, 6, and 2 occur twice, and 7 occurs once.

Counting Frequency of NaN Values using Dropna Argument

NaN (Not a Number) values are used in pandas to represent missing data. To count the frequency of NaN values in a pandas series, we can use the dropna argument of the value_counts() function.

The dropna argument removes all NaN values from the series before counting its unique values. For instance, consider the following series with NaN values:

data = pd.Series([3, 4, 5, 2, 4, 2, 6, 7, 3, 5, 6, None, None, None])

To count the frequency of non-NaN values, we can use the following code:

freq = data.value_counts(dropna=True)

print(freq)

Output:

4.0 2

3.0 2

5.0 2

6.0 2

2.0 2

7.0 1

dtype: int64

Counting Relative Frequency using Normalize Argument

The normalize argument of the value_counts() function can be used to calculate the relative frequency of unique values in a pandas series. The normalize argument accepts a boolean value, where True means that the counts will be normalized to represent the relative frequency, and False means that the counts will represent the absolute frequency.

For instance, consider the following series:

data = pd.Series([3, 4, 5, 2, 4, 2, 6, 7, 3, 5, 6])

To calculate the relative frequency of non-NaN values, we can use the following code:

freq = data.value_counts(normalize=True)

print(freq)

Output:

4 0.181818

3 0.181818

5 0.181818

6 0.181818

2 0.181818

7 0.090909

dtype: float64

The output shows the relative frequency of unique values in the series. In this case, each unique value occurs with a frequency of 0.181818, except for the value 7, which occurs with a frequency of 0.090909.

Counting Frequency in Equal-Sized Bins using Bins Argument

The value_counts() function can also be used to count the frequency of values in equal-sized bins. The bins argument specifies the number of bins to divide the data into.

For instance, consider a pandas series with the following data:

data = pd.Series([3, 4, 5, 2, 4, 2, 6, 7, 3, 5, 6, 8, 10, 12, 18, 25, 30])

To count the frequency of values in three equal-sized bins, we can use the following code:

bins = [0, 10, 20, 30]

freq = pd.cut(data, bins=bins).value_counts()

print(freq)

Output:

(0, 10] 11

(10, 20] 4

(20, 30] 2

dtype: int64

The output shows the frequency of values in three bins. The first bin (0 to 10) contains 11 values, the second bin (10 to 20) contains 4 values, and the third bin (20 to 30) contains 2 values.

Counting Frequency of Values in Pandas DataFrames

Pandas dataframes are tabular data structures that contain multiple rows and columns. To count the frequency of values in a pandas dataframe, we need to specify the specific column we want to count.

For instance, consider the following dataframe:

data = pd.DataFrame({‘name’: [‘John’, ‘Mary’, ‘Steve’, ‘John’, ‘Bob’],

‘age’: [32, 25, 19, 32, 40]})

To count the frequency of names in the dataframe, we can use the following code:

freq = data[‘name’].value_counts()

print(freq)

Output:

John 2

Bob 1

Mary 1

Steve 1

Name: name, dtype: int64

The output shows the frequency of names in the ‘name’ column of the dataframe. In this case, John occurs twice, and the other names occur once.

Additional Resources

Apart from the functions explained in this article, pandas offers many other common functions that can be useful for data analysis. You can find more information on these functions by referring to the pandas documentation or exploring pandas tutorials online.

Some of the commonly used functions include groupby(), merge(), pivot_table(), and resample(). These functions perform grouping and aggregation operations on data, merging data from multiple sources, reshaping and pivoting data, and resampling time series data, respectively.

Conclusion

In this article, we explored five functions in pandas that are used to count the frequency of unique values, NaN values, relative frequency, frequency in equal-sized bins, and frequency of values in pandas dataframes. By learning how to use these functions, you can gain insights into the distribution of data in your pandas series or dataframes.

Pandas offers many other functions for data analysis, and you can explore them further to master the art of data manipulation in Python. In this article, we discussed how to count the frequency of unique values, NaN values, relative frequency, frequency in equal-sized bins, and frequency of values in pandas dataframes using the value_counts() function in Pandas.

We explored how to use various arguments such as dropna, normalize, and bins to get counts in specific conditions. We also emphasized the importance of mastering these functions to gain insights into the distribution of data in our data analysis.

By studying the functions in this article, readers can enhance their proficiency in data manipulation using Pandas and improve their data analysis skills in Python.

Popular Posts