Matching the properties of the variable and distribution

Before attempting to fit a probability distribution to a set of observed data, you should first consider the properties of the variable in question. The properties of the distribution or distributions chosen to be fitted to the data should match those of the variable being modelled. Distribution fitting software have made fitting distributions to data very easy and removed the perceived need for any in-depth statistical knowledge. These products can be useful but, through their automation and ease of use, inadvertently encourage the user to attempt fits to wholly inappropriate distributions. It is therefore worth considering the following points before attempting a fit:

Is the variable to be modelled discrete or continuous?

A discrete variable may only take certain specific values, for example the number of bridges along a motorway, but a measurement such as the volume of tarmac, for example, is continuous. A variable that is discrete in nature is usually, but not always, best fitted to a discrete distribution. A very common exception is where the increment between contiguous allowable values is insignificant compared with the range that the variable may take. For example, consider a distribution of the number of people using the London Underground on any particular day. Although there can only be a whole number of people using the Tube, it is easier to model this number as a continuous variable since the number of users will number in the millions and there is little importance and considerable practical difficulty in recognizing the discreteness of the number.

In certain circumstances, discrete distributions can be very closely approximated by continuous distributions for large values of x. If a discrete variable has been modelled by a continuous distribution for convenience, its discrete nature can easily be put back into the risk analysis model by using the ROUND(...,0) function in Excel

The reverse of the above, however, never occurs, i.e. data from a continuous variable is always fitted to a continuous distribution and never a discrete distribution.

Does the theoretical range of the variable match that of the fitted distribution?

The fitted distribution should, within reason, cover the range over which the variable being modelled may theoretically extend. If the fitted distribution extends beyond the variable's possible range, a risk analysis model will produce impossible scenarios, although ModelRisk's VoseXBounds and VosePBounds functions can be used to reduce the distribution's range. If the distribution fails to extend over the entire possible range of the variable, the risk analysis will not reflect the true uncertainty of the problem. For example, data on the oil saturation of a hydrocarbon reserve should be fitted to a distribution that is bounded at zero and 1 since values outside that range are nonsensical. It may turn out that a Normal(0.2,0.03) distribution, for example, fits the data far better than any other shape but, of course, it extends from -∞ to +∞. In order to ensure that the risk analysis only produces meaningful scenarios, the Normal distribution would be truncated in the risk analysis model at zero and 1 by using VoseNormal(0.2,0.03,,VoseXBounds(0,1)).

Note that a correctly fitted distribution will usually cover a range that is greater than that displayed by the observed data. This is quite acceptable because data are rarely observed at the theoretical extremes for the variable in question.

Is this variable independent of other variables in the model?

The variable may be correlated with, or a function of, another variable within the model. It may also be related to another variable outside the model but which, in turn affects a third variable within the risk analysis model. The figures illustrate a couple of examples:

In this example a high street bank's revenue is modelled as a function of the interest and mortgage rates, among other things. The mortgage rate is correlated to the interest rate since the interest rate largely defines what the mortgage rate is to be. This relationship must be included in the model to ensure that the simulation will only produce meaningful scenarios. There are two approaches to modelling such dependency relationships:

1) Determine distributions for the mortgage and interest rates based on historical data and then correlate the sampling from these distributions during simulation.

2) Determine the distribution of interest rate from historical data and a (stochastic) functional relationship with the mortgage rate.

Method 1 is tempting because of its simple execution, but method 2 offers greater opportunity to reproduce any observed relationship between the two variables.

In this second example, a construction sub-contractor is calculating her bid price to supply labour for a roofing job. The choice of roofing material has not yet been decided and this uncertainty has implications for the person-hours that will be needed to construct the roofing timbers and to lay the roof. There is therefore an indirect dependency between these two variables that could easily have been missed, had she not looked outside the immediate components of her cost calculation. Missing this correlation would have resulted in an underestimation of the spread of the subcontractor's cost and potentially could have lead her to quote a price that exposed her to significant loss. Correlation and dependency relationships form a vital part of many risk analyses. Several techniques to model correlation and dependencies between variables are discussed here.

Matching the properties of the variable and distribution

Is the variable to be modelled discrete or continuous?

Does the theoretical range of the variable match that of the fitted distribution?

Is this variable independent of other variables in the model?

Navigation