Selecting the appropriate distributions for your model | Vose Software

Selecting the appropriate distributions for your model

See also: Distributions introduction, Distributions in ModelRisk

The precision of a risk analysis relies very heavily on the appropriate use of probability distributions to accurately represent the uncertainty, randomness and variability of the problem. In our experience, inappropriate use of probability distributions has proven to be a very common failure of risk analysis models. It stems, in part, from an inadequate understanding of the theory behind probability distribution functions and, in part, from failing to appreciate the knock-on effects of using inappropriate distributions.

In this section we discuss five basic properties of distributions and how these properties should be used to select the distributions in your model. The five properties are:

 

Finally, we have put together tables with links for each distribution as it fits into each category for both univariate and multivariate distributions.

Discrete and continuous distributions

The most basic distinguishing property between probability distributions is whether they are continuous or discrete. It is extraordinary how often the discrete or continuous nature of a variable is overlooked when selecting the distribution that will be used to model it.

Discrete distributions

A discrete distribution may take one of a set of identifiable values, each of which has a calculable probability of occurrence. Discrete distributions are used to model parameters like the number of bridges a roading scheme may need, the number of key personnel to be employed or the number of customers that will arrive at a service station in a hour. Clearly, variables such as these can only take specific values: one cannot build half a bridge, employ 2.7 people or serve 13.6 customers.

Continuous distributions

A continuous distribution is used to represent a continuous variable, i.e. a variable that can take any value within a defined range (domain). For example, the height of an adult English male picked at random will have a continuous distribution because the height of a person is essentially infinitely divisible. We could measure his height to the nearest centimeter, millimeter, tenth of a millimeter, etc. The scale can be repeatedly divided up generating more and more possible values.

Properties like time, mass and distance, that are infinitely divisible, are modelled using continuous distributions. In practice, we also use continuous distributions to model variables that are, in truth, discrete but where the gap between allowable values is insignificant: for example, project cost (which is discrete with steps of one penny, one cent, etc.), exchange rate (which is only quoted to a few significant figures), number of employees in a large organization, etc.

Bounded and unbounded distributions

A distribution that is confined to lie between two determined values is said to be bounded. A distribution that is unbounded theoretically extends from minus infinity to plus infinity. A distribution that is constrained at one or either end is said to be partially bounded. Unbounded and partially bounded distributions may, at times, need to be constrained to remove the tail of the distribution so that nonsensical values are avoided. For example, using a Normal distribution to model sales volume opens up the chance of generating a negative value. If the probability of generating a negative value is significant, and we want to stick to using a Normal distribution, we must constrain the model in some way to eliminate any negative sales volume figure being generated. ModelRisk provides the function VoseXbounds( ) for this purpose. For example =VoseNormal(10, 3, VoseXBounds(5, 17)) will truncate a Normal(10, 3) distribution to lie between 5 and 17, as shown below. VoseNormal(10, 3, VoseXBounds(, 15)) would just cut off the right tail at 15.

Right-bounded distributions

You will notice from the table below that none of the distributions are bounded only on the right extreme. However, if you require a right bounded distribution for some reason, you need simply invert a left bounded distribution. For example: =-VoseWeibull(2,5) produces a left-skewed distribution with an unbounded minimum and a maximum of 0; =10-VoseGamma(2,1.5) produces a left-skewed distribution with an unbounded minimum and a maximum of 10, as shown in the Figures below:

Parametric and non-parametric distributions

There is a very useful distinction to be made between model-based parametric and empirical non-parametric distributions. By 'model-based', we mean a distribution whose shape is borne of the mathematics describing a conceptual probability model. By 'empirical' or 'non-parametric' we mean a distribution whose mathematics is defined by the shape that is required. For example, a Triangle distribution is defined by its minimum, mode and maximum values. The defining parameters are features of the graph shape.

Those distributions that fall under the 'empirical' or non-parametric class are intuitively easy to understand, extremely flexible and are therefore very useful. Model-based or parametric distributions require a greater knowledge of the underlying assumptions if they are to be used properly.

Parametric distributions should only be selected if either:

  1. The theory underpinning the distribution applies to the particular problem;

  2. It is generally accepted that a particular distribution has proven to be very accurate for modeling a specific variable without actually having any theory to support the observation;

  3. The distribution matches the observed data very well indeed; or

  4. One wishes to use a distribution that has a long tail extending beyond the observed minimum or maximum. These issues are discussed in more detail in the optional module on fitting distributions to data.

Univariate and multivariate distributions

Univariate distributions describe a single parameter or variable and are used to model a parameter or variable that is not probabilistically linked to any other in the model. Multivariate distributions describe several parameters whose values are probabilistically linked in some way. In most cases, we create the probabilistic links via one of several correlation methods. However, there are a few specific multivariate distributions that have specific, very useful purposes and are therefore worth studying more.

First or Second order distribution

A probability or inter-individual variability distribution for which the parameters are precisely known is called a first-order distribution. A probability or inter-individual variability distribution for which there is some uncertainty about the parameters is called a second-order distribution. Thus, for example, Normal(100,10) is a first order distribution, whereas Normal(m,s) is a second order distribution if m and s are estimated and thus themselves carry uncertainty distributions. You cannot have a second-order distribution of uncertainty because you cannot have uncertainty about uncertainty - it collapses down to the one distribution of uncertainty, just the same as you cannot have a probability distribution of a probability distribution - it collapses down to a single probability distribution.

A plot of a first-order distribution is easy to understand. For example:

It is somewhat more difficult to illustrate a second-order distribution. One needs to account for the uncertainty about the distribution, which is usually done by using a number of lines to reflect possible true distributions (sometimes called candy-floss or spaghetti plots):  

The second-order cumulative plot is generally much clearer than its corresponding density plot.

Table of distributions

The table below gives an overview of the various distributions available in ModelRisk, so that you can most easily focus on which ones might be most appropriate for your modeling needs. Follow the links for an in-depth explanation of each. We have used the most common name for each distribution. If you are interested in a particular distribution whose name does not appear here, try using the search facility because many distributions have several names, or are recognized as simply special cases of other, more common distributions.

Univariate Distributions

 

Continuous

Discrete

Unbounded

Cauchy

Error function

Error

Extreme Value Max

Extreme Value Min

Generalized Extreme Value

GLogistic

Hyperbolic Secant

JohnsonU

KernelCU

Laplace 

Logistic

Mixed Normal

Normal 

Skew Normal

Slash 

Student-t

Student3

Skellam

Left or right bounded

Bradford

Burr

Chi

Chi Squared

Dagum

Erlang 

Exponential

F

Fatigue Life

Gamma

Inverse Gaussian

Lifetime2

Lifetime3

LifetimeExp


Levy

LogGamma

LogLaplace

LogLogistic 

Lognormal

Lognormal (base B)

Lognormal (base E)

Maxwell

NCChiSq

NCF

Pareto (1st kind)

Pareto (2nd kind)

Pearson 5

Pearson 6

Rayleigh

Weibull 

Weibull3

Beta-geometric

Beta-Negative Binomial

Burnt Finger Poisson

Delaporte

Geometric

HypergeoM

Inverse Hypergeometric

Logarithmic

Negative Binomial

Poisson

Poisson Uniform

Polya

Left and right bounded

Beta

BetaSubj

Beta4

Bradford

Cumulative ascending

Cumulative descending

GTU

Histogram

JohnsonB

Kumaraswamy

Kumaraswamy4

LogTriangle

LogUniform

Modified PERT

Ogive

PERT

PERTAlt

Reciprocal

Relative

Split Triangle

Triangle

TriangleAlt

Uniform

Bernoulli

Beta-binomial

Binomial

Discrete

Discrete Uniform

Hypergeometric

HypergeoM

HypergeoD

InvHyperGeo


Step Uniform

Italics indicate non-parametric distributions

Multivariate Distributions

 

Continuous

Discrete

Unbounded

Multivariate Normal

 

Left bounded

 

 

Negative Multinomial 1

Negative Multinomial 2

Left and right bounded

Dirichlet

Multinomial

Multivariate Hypergeometric

Multivariate Inv Hypergeometric 1

Multivariate Inv Hypergeometric 2

 

Navigation