Everything You Need to Know About Histograms
This article was published as a part of the Data Science Blogathon.
Histograms are one of the best plots which can iterate through the distribution of the dataset. Let’s go through the proper definition of histograms. These graphs project the statistical information about the distribution using the rectangle blocks that, in turn, will launch the frequency spread out in equal sizes (numerically).
In this article, we will cover everything about Histograms, i.e. when we should use histograms. Then using the same in the real-world scenario will help us know how it can help get the business-related insights that data visualization is an important aspect of all the data science projects where plots like histograms are convenient.
When to Use Histograms
Before using any technique, we first need to analyze when we can use that (in this case, histograms) to get the best out of it. Hence, in this section, we will discuss some key aspects where histograms can be extremely useful.
- The first thing to remember while using histograms is the data type, which usually gives better insights into the numerical data.
- Then if we want to see the distribution of the dataset for that reason, this plot is handy to determine whether the data is normally distributed or not.
- Histograms can also be helpful when we want to find out about the client’s process in terms of output being generated.
- When we get the problem statement to analyze the time series data at that time, histograms can come to the rescue to help start the analysis, which can determine changes occurring over time.
- Histograms are also very beneficial in identifying popular statistical measures such as median, standard median, minimum, and maximum data points.
Loading Necessary Libraries
import numpy as np import matplotlib.pyplot as plt import seaborn as sb import pandas as pd plt.rcParams["figure.figsize"] = (10,6)
Let’s discuss in a nutshell the use of each library that we have imported:
- Numpy: Used to implement the statistics and mathematical formulae using pre-defined functions.
- Matplotlib: This is the OG library for data visualization we will mainly use this only to plot the charts/graphs.
- Seaborn: Another visualization library built on top of matplotlib.
- Pandas: Used for DataFrame manipulation and data cleaning.
To keep this section easy to understand, the data generated is something I’ve thrown together. It’s not just an analytic function from Scipy – that would make it too easy – but I’ve ensured it’s not pathological.
Let’s start with the imports to make sure we have everything right at the beginning. If this errors, pip install whichever dependency you don’t have. If you have issues (especially on windows machines with NumPy), try using Conda install. For example, Conda installs NumPy.
d1 = np.loadtxt("example_1.txt") d2 = np.loadtxt("example_2.txt") print(d1.shape, d2.shape)
Inference: As mentioned above, we have loaded two datasets that I have created for demonstration purposes, and when we had a look at the shape of both of the datasets then, we saw that both of them have 500 rows.
Now it’s time to get our hands dirty and see different aspects of histograms. Here we will discuss 3 plots that can help get the nominal data distribution simultaneously for both datasets.
- Normal plots: For a basic understanding of histograms, i.e. to see the distribution of data points within the specified range.
- Density and bins: Taking a step forward to density parameter will help us know the probability connection with histograms, and bins will help in the normalization
- Customizing styles: To give histograms a good look and feel, we will also customize it based on their availability.
plt.hist(d1, label="D1") plt.hist(d2, label="D2") plt.legend() plt.ylabel("Counts");
Hit Run to see the output.
Inference: We have used the hist plot from matplotlib for both D1 and D2 to plot the histogram. Note that, in the plot, we can see the blue area as D1 and the yellow area as D2, performed by the legend() function.
Drawbacks: Major drawback is that the plots are completely opaque; hence we cannot spot the distribution clearly; second is irregular bin size which is portraiting the false story.
bins = np.linspace(min(d1.min(), d2.min()), max(d1.max(), d2.max()), 50) counts1, _, _ = plt.hist(d1, bins=bins, label="D1") plt.hist(d2, bins=bins, label="D2") plt.legend() plt.ylabel("Counts");
Inference: In this step, we have almost resolved the drawback of normalizing the bins so that now the data points stimulating the false story give better insight.
Instead of the default bins, we have created a new variable which stores the range of minimum and maximum values from the distribution of the D1 and D2 datasets to have the custom bins.
bins = np.linspace(min(d1.min(), d2.min()), max(d1.max(), d2.max()), 50) counts1, _, _ = plt.hist(d1, bins=bins, label="D1", density=True) plt.hist(d2, bins=bins, label="D2", density=True) plt.legend() plt.ylabel("Probability");
Inference: As mentioned at the start of the article, histogram is highly related to probability. One can notice in the Y-Axis and compare it with the previous plot where it has counts, while in this plot, we have a probability distribution. We have achieved this by adding only one parameter, i.e. density and setting it up as True.
bins = np.linspace(min(d1.min(), d2.min()), max(d1.max(), d2.max()), 50) plt.hist([d1, d2], bins=bins, label="Stacked", density=True, alpha=0.5) plt.hist(d1, bins=bins, label="D1", density=True, histtype="step", lw=1) plt.hist(d2, bins=bins, label="D2", density=True, histtype="step", ls=":") plt.legend() plt.ylabel("Probability");
Inference: Now comes the styling part where we will improve the look and feel of Matplotlib’s histogram plot by changing and adding a few parameters from the hist() function.
- Stacked chart: There is no specific function to achieve the stacked functionality. We just have to add two or three datasets as a list and work is made!
- Alpha: This parameter is used to manage the opacity of bars. We can use this when we want to highlight one data more than another.
- Hist-Type: This parameter can convert the typical histograms into bar charts (step) as we can see the spacing between the continuous distribution, which is quite similar to bar graphs.
- Line width & line style: Line width is denoted as “l-w”, which helps us to increase or decrease the width of the line, whereas line style is denoted as ls; one can notice that I have chosen the “:”, which is visible too for the D2 dataset.
Here comes the last part of the article, where we will have a look at most preferably everything in short that we have discussed so far about histograms that what histograms, why, and where they can be used, and then practical implementation of the same, instead of paragraph explanation let’s look at the point-to-point briefing.
- Starting with the introduction, we learned about histograms, which answered our “what” question. Then from what, we traversed to “When”, where we discussed when is the best time to use those plots.
- After getting theoretical knowledge, we moved forward to implement the histograms plot, where we first drew the basic plot of the same and gradually moved towards discussing the parameters like density and bins.
- At last, we learned how to improve the plots’ look and feel by using basic parameters like alpha, line width and line style, stacked chart, etc.
Here’s the repo link to this article. I hope you liked my article on Everything you need to know about Histograms. If you have any opinions or questions, comment below.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.