Adventures in Machine Learning

Visualizing Data: Building and Plotting Histograms in Python

Do you ever wonder how to effectively represent data so that it becomes easier to interpret and understand? Have you ever observed how data is distributed?

Histograms can help us achieve both of these objectives. In this article, we will explore the concept of histograms, their importance, and how we can make and visualize them using Python libraries.

Histograms: Definition and Importance

Histograms are used to understand the distribution of a set of continuous data by dividing the data into intervals or bins. These bins contain a specific range of data values, and the height of the bin represents the frequency of values in that range.

They allow us to identify patterns, trends, and outliers in our data. Histograms can be used to represent data for a wide range of applications such as finance, healthcare, and marketing analytics.

Different Options for Building and Plotting Histograms in Python Libraries

Python libraries like NumPy, Matplotlib, Pandas, and Seaborn, provide different approaches to create histograms in Python. NumPy and Matplotlib provide core functionality for numerical calculations and creating graphs.

Pandas is an excellent library for data manipulation and analysis, whereas Seaborn is focused on the visualization of statistical data. Below are some examples of how we can use these libraries to build histograms.

Using NumPy:

import numpy as np
import matplotlib.pyplot as plt
data = [1.2, 1.5, 1.6, 2.1, 2.6, 2.7, 2.7, 3.1, 3.2, 3.3, 3.3, 3.5, 3.9, 4.0]
plt.hist(data, bins=5, alpha=0.5, density=True, color='blue')
plt.show()

Using Matplotlib:

import matplotlib.pyplot as plt
plt.hist([1, 2, 2, 3, 3, 3, 4, 4, 5], bins=5)
plt.show()

Using Pandas:

import pandas as pd
import matplotlib.pyplot as plt
data = pd.read_csv('data.csv')
data.plot.hist(bins=10, alpha=0.5)
plt.show()

Using Seaborn:

import seaborn as sns
data = sns.load_dataset('mpg')
sns.histplot(data['mpg'], kde=True)
plt.show()

Notice how in each example above, we have provided the data and additional parameters to customize the visualization. We can adjust the number of bins, color, transparency, and a lot more to achieve the most useful visualization for our specific data.

Histograms in Pure Python

While we can use libraries like NumPy, Matplotlib, Pandas, and Seaborn to create histograms in Python, we can also create them using pure Python. Let’s explore some examples below.

Creating Frequency Tables using Python Dictionaries

The process of making a histogram involves counting the number of data points falling within the specified range. One way to count this is by using dictionaries, where the key is the bin, and value is the count.

def count_frequency(arr, step):
    freq_dict = {}
    for i in range(len(arr)):
        idx = int(arr[i] / step)
        if idx in freq_dict:
            freq_dict[idx] += 1
        else:
            freq_dict[idx] = 1
    return freq_dict 
data = [1, 2, 2, 3, 3, 3, 4, 4, 5]
freq_table = count_frequency(data, 1)

print(freq_table)

Output:

{1: 2, 2: 2, 3: 3, 4: 2, 5: 1}

Using Counter() to Create Frequency Tables and Comparing it with Handmade Function

Python’s built-in collections module provides a Counter class that can be used to count the frequency of elements in the data. Let’s compare it with the handmade function we created above.

from collections import Counter
data = [1, 2, 2, 3, 3, 3, 4, 4, 5]
counter = Counter(data)

print(counter)
freq_table = count_frequency(data, 1)

print(freq_table)

Output:

Counter({3: 3, 2: 2, 4: 2, 1: 1, 5: 1})

{1: 2, 2: 2, 3: 3, 4: 2, 5: 1}

Creating An ASCII Histogram using Output Formatting in Python

We can represent the frequency table we created above in the form of an ASCII histogram. It involves displaying asterisks, and each asterisk represents a count in the frequency table.

def ascii_histogram(freq_table):
    for i in freq_table.keys():
        print('{0:2d} - {1:<2d} : {2}'.format(i*10, (i+1)*10-1, '*'*freq_table[i]))
data = [1, 2, 2, 3, 3, 3, 4, 4, 5]
freq_table = count_frequency(data, 1)

ascii_histogram(freq_table)

Output:

10 – 19 : **

20 – 29 : **

30 – 39 : ***

40 – 49 : **

50 – 59 : *

Conclusion

In this article, we discussed histograms, their definition, importance, and different options for building and plotting them in Python libraries. We also explored ways to create histograms using pure Python with examples like creating frequency tables using dictionaries, using collections.Counter() to create frequency tables, and creating an ASCII histogram.

Visualizing data using histograms is crucial regardless of the industry or field of study. We hope that this article helps you start visualizing your data better by building histograms in Python.

NumPy and Histograms

Histograms are a key tool in data analysis, and Python’s NumPy library offers a range of functions for creating and manipulating histograms. In this article, we will explore NumPy’s histogram function in detail, including how to use it to create a true histogram by binning the data, and how to construct a frequency table by utilizing np.bincount().

The NumPy library contains several built-in functions for working with histograms, including the histogram function.

This function takes as input an array of data values and returns an array of the same dimension with the frequency counts of values in each bin of the histogram. The histogram function in NumPy has several parameters, including bins, range, density, and weights.

The bins parameter specifies the number of bins to divide the data into, and the range parameter sets the range of values to include in the histogram. The density parameter can be set to True to normalize the histogram, and the weights parameter allows you to specify an array of weights to be applied to each data value.

Using NumPy’s histogram() to create a true histogram by binning the data

A true histogram is a representation of the probability density function (PDF) of the data, which is the function that describes the probability of the data taking a particular value. To create a true histogram, we need to bin the data into intervals of equal width and height, such that the sum of the areas of all the bins equals 1.

To illustrate, let’s create a histogram for a set of data using NumPy’s histogram() function.

import numpy as np
import matplotlib.pyplot as plt

# Generate some sample data
data = np.random.randn(1000)

# Create a true histogram of the data
hist, bins = np.histogram(data, bins=30, density=True)

# Plot the histogram
plt.hist(data, bins=30, density=True, alpha=0.5)
plt.plot(bins[:-1], hist, 'r')
plt.show()

In the code above, we first generate some sample data using NumPy’s random.randn() function. We then create a histogram of the data using NumPy’s histogram() function with the bins parameter set to 30 and the density parameter set to True.

Finally, we plot the histogram using matplotlib’s hist() function and overlay the true histogram using matplotlib’s plot() function.

Utilizing np.bincount() to construct a frequency table from the histogram

Once we have created a histogram of our data, we may want to extract a frequency table from it.

A frequency table is a table that lists the frequency counts for each bin of the histogram. NumPy’s bincount() function can be used to extract a frequency table from a histogram.

To illustrate, let’s create a frequency table for the same data we used in the previous example.

import numpy as np

# Generate some sample data
data = np.random.randint(0, 10, size=100)

# Create a histogram of the data
hist, bins = np.histogram(data, bins=10)

# Extract a frequency table from the histogram
freq_table = np.bincount(np.digitize(data, bins))

# Print the frequency table
print(freq_table)

In the code above, we first generate some sample data using NumPy’s random.randint() function. We then create a histogram of the data using NumPy’s histogram() function with the bins parameter set to 10.

Finally, we extract a frequency table from the histogram using NumPy’s bincount() function and print the resulting array.

Conclusion

In summary, NumPy’s histogram function is a powerful tool for working with histograms in Python. By using NumPy’s histogram() function, we can create true histograms by binning our data and overlaying the PDF.

Additionally, we can use NumPy’s bincount() function to extract frequency tables from histograms, allowing us to perform further analysis on our data. By leveraging these capabilities of NumPy, we can gain valuable insights into the distribution of our data and make informed decisions in our data analysis.

Wrapping Up

In this article, we explored the topic of histograms and their significance in data analysis. We discussed how to build and plot histograms using Python libraries like NumPy, Matplotlib, Pandas, and Seaborn.

We also delved into pure Python techniques for histogram creation, such as creating frequency tables using dictionaries and using output formatting to construct an ASCII histogram. Furthermore, we examined NumPy’s histogram function in detail and learned how to create a true histogram by binning the data and how to construct a frequency table by utilizing np.bincount().

Overall, understanding histograms can provide valuable insights into the distribution of data for a range of applications in fields like finance, healthcare, and marketing analytics. The use of histograms can help identify patterns, trends and outliers in our data.

With the help of Python libraries and NumPy, we can create and manipulate histograms more efficiently. Therefore, data scientists and analysts should familiarize themselves with histograms and their functionality for data analysis.

Popular Posts