Fitting a continuous non-parametric second-order distribution to data | Vose Software

# Fitting a continuous non-parametric second-order distribution to data

When we do not have a great deal of data, a considerable amount of uncertainty will remain about an empirical distribution determined directly from the data. It would be very useful to have the flexibility of using an empirical distribution, i.e. not having to assume a parametric distribution, and also to be able to quantify the uncertainty about that distribution. The following Bayesian technique provides these requirements.

Consider a set of n data values {xj} drawn from a distribution, and ranked in ascending order {xi} so xi<xi+1. Data thus ranked are known as the order statistics of {x}. We can use these order statistics to construct a second order empirical cumulative distribution.

Click here if you wish to see the mathematics behind the technique

Recapping from the proof, the formulae from Equations 3 and 4 are:

(3)

(4)

where Pj represents the estimate of the cumulative distribution function at xj, = F(xj). These can be used as inputs into ModelRisk's VoseCumulA distribution, together with subjective estimates of the minimum and maximum values that the variable may take, which can also be assigned subjective distributions:

=VoseCumulA(min, max, {xj}, {Pj}).

Alternatively, ModelRisk's VoseOgiveU({x},min, max) function will construct this function automatically this automatically.

##### Using this technique in a second order model
• The uncertainty distributions for F(x) are nominated as outputs;

• A smallish number of iterations are run;

• The resultant data are exported back to a spreadsheet;

• Those data are then used to perform multiple simulations (the 'outer loop') of uncertainty: the 'inner loop' of comes from the Cumulative distribution itself

##### Limitations

There are a few limitations to this technique. In using a Cumulative distribution function, one is assuming a histogram style probability distribution function between each of the {x} values. When there are a large number of data points, this approximation becomes irrelevant. However, for small data sets the approximation will tend to accentuate the tails of the distribution: a result of the histogram 'squaring-off' effect of using the Cumulative distribution. In other words, the variability will be slightly exaggerated. However, the squaring effect can be reduced, if required, by using some sort of smoothing algorithm and defining points between each observed value. In addition, for small data sets, the tails' contribution to the variability will often be more influenced by the subjective estimates of the minimum and maximum values: a fact one can view positively (one is recognizing the real uncertainty about a distribution's tail), and negatively (the smaller the data set, the more the technique relies on subjective assessment).

The fewer the data points, the wider the confidence intervals will become quite naturally and, in general, the more emphasis will be placed on the subjectively defined minimum and maximum values. Conversely, the more data points available, the less influence the minimum and maximum estimates will have on the estimated distribution. In any case, the values of the minimum and maximum only have influence on the width (and therefore height) of the end two histogram bars in the fitted distribution. The fact that the technique is non-parametric, i.e. that no statistical distribution with a particular cumulative distribution function is assumed to be underlying the data, allows the analyst a far greater degree of flexibility and objectivity than that afforded by fitting parametric distributions.