Fitting a continuous non-parametric first-order distribution to data | Vose Software

Fitting a continuous non-parametric first-order distribution to data

See also: Fitting distributions to data, Fitting in ModelRisk, Analyzing and using data introduction

If the observed data are continuous and reasonably extensive, it is often sufficient to use a cumulative frequency plot of the data points themselves (sometimes known as an ogive) to define the variable's probability distribution. The figure below illustrates an example with 18 data points.

The observed F(x) values are calculated as the expected F(x) that would correspond to a random sampling from the distribution, i.e. F(xi) = i / (n + 1) where i is the rank of the observed data point and n is the number of data points. An explanation for this formula is provided here. Determination of the empirical cumulative distribution proceeds as follows:

  • The minimum and maximum for the empirical distribution are subjectively determined based on the analyst's knowledge of the variable. For a continuous variable, these values will be outside the observed range of the data. The minimum and maximum values selected here are zero and 45.

  • The data points are ranked in ascending order between the minimum and maximum values.

  • The cumulative probability F(xi) for each xi-value is calculated as follows:

      

      This formula maximises the chance of replicating the true distribution. Use the VoseOgive1 function to generate an array of these F(xi) values directly.

  • The two arrays, {xi} and {F(xi)}, along with the minimum and maximum values, can then be used as direct inputs into a Cumulative distribution using VoseCumulA(min, max, {xi}, {F(xi)}). Alternatively, you could use the VoseOgive({x}, min, max) function to construct the distribution directly.

If there is a very large amount of data, it becomes impracticable to use all of the data points to define the Cumulative distribution. In such cases, it is useful to batch the data first. The number of batches should balance fineness of detail (large number of bars) with the practicalities of having large arrays defining the distribution (lower number of bars).

 

Navigation