Determining the width of histogram bars | Vose Software

Determining the width of histogram bars

See also: Histogram plots, Graphical descriptions of model outputs, Sturges' rule

A histogram plot is a natural way of representing a set of samples drawn from a univariate variable. It shows the range, central tendencies and shape of the distribution of the data. However, in order to make a histogram plot one must decide on the number of bars to use.

Sturges' rule

Most statistical software use Sturges rule which says the data range should be split into k equally spaced classes where

            

where  is the ceiling operator (meaning take the closest integer above the calculated value). Click here for a more detailed explanation of this equation.

Sturges' rule is the most commonly applied in statistical software, even though it is not actually that good when the data exhibit skewness or any other non-normality. There are two better guidelines:

Scott's rule

Scott (1979) proposed that the bar width w should be determined as follows:

where s is the sample standard deviation of the n data values. The equation is derived from attempting to minimize the bias in variance of the histogram compared with the data set. The underlying theory requires knowledge of the distribution form of the data, which we rarely have, so the above equation assumes normality although this turns out to be rather unrestricting in practice.

Freedman and Diaconis' rule

Freedman and Diaconis (1981) proposed that the bar width w should be determined as follows:

where IQR is the sample inter-quartile range of the n data values, i.e. the difference between the 75th and 25th percentile of the data. The rule was based on the goal of minimizing the sum of squared errors between the histogram bar height and the probability density of the underlying distribution which gave the  part of the equation. The use of 2*IQR as a measure of spread was determined from their empirical experiments.

A recommendation

In general, we have found Scott's rule gives the most pleasing balance between detail and overview (Freedman and Diaconis' rule generally produces more bars), but the histogram bar ranges will take awkward minimum and maximum values which makes the histogram less easy to read.

In the histograms produced by Vose Data Analysis in ModelRisk we round off to more intuitive values. The user can also overrule the settings which is most useful when there are extreme tails in the data.

 


 

 

Navigation