 Constructing an empirical distribution from data | Vose Software

# Constructing an empirical distribution from data

#### Situation

You have a set of random and representative observations of a single model variable, for example the number of children in American families (we'll look at a joint distribution for two or more variables at the end of this section), and you have enough observations to feel that the range and approximate random pattern has been captured. You want to use the data to construct a distribution directly, rather than fit a parametric distribution to it.

#### Technique

It is unnecessary to fit a distribution to the data: instead one can simply use the empirical distribution of the data (if there are no physical or biological reasons a certain distribution should be used, we generally prefer an empirical distribution). Below, we outline three options you have to use this data to construct an empirical distribution:

1. DUniform: uses only the list of observed values

2. CumulA and CumulD or more simply an Ogive: creates a cumulative distribution, and therefore allows values between those observed, and values beyond the observed range;

3. Histogram: when you have huge amounts of data

Option 1: A Discrete Uniform distribution

ModelRisk offers a Discrete Uniform distribution that takes one parameter: a list of values. It then randomly picks any one of those values with equal probability (sampling with replacement). Thus, for example, =VoseDUniform({1,4,5,7,10}) will generate, with each iteration, one of the five values 1, 4, 5, 7 or 10 (each value has during each iteration a probably of being picked of 20%). The figure below shows what the probability distribution looks like. Let's imagine that we have our data in an array of Cells called 'Observations'. By simply writing =VoseDUniform(Observations) we will generate a distribution that replicates the pattern of the observed data. You can use the DUniform distribution for both discrete and continuous data providing you have sufficient observations.

Option 2: A Cumulative distribution

If your data are continuous you also have the option of using a Cumulative distribution. This is a distribution that ModelRisk offers that takes four parameters: a minimum, a maximum, a list of values, and a list of cumulative probabilities associated with those values. From these parameters, it then constructs an empirical cumulative distribution by straight-line interpolation between the points defined on the curve. In ModelRisk, there are two forms of the Cumulative distribution: the Cumulative Ascending and the Cumulative Descending distribution.

Our best guess of the cumulative probability of a data point in a set of observations turns out to be r/(n+1) where r is the rank of the data point within the data set and n is the number of observations. Thus, when choosing this option, one needs to:

• Rank the observations in ascending or descending order (Excel has an icon that makes this simple);
• In a neighbouring column, calculate the rank of the data: write a column of values 1, 2, ... n;
• In the next column, calculate the cumulative probability F(x) = rank/(n+1);
• Use the data and F(x) columns as inputs to the VoseCumulA or VoseCumulD distribution, together with subjective estimates of what the minimum and maximum values might be.

An easier option is to use the Ogive distribution from ModelRisk, which performs these calculations automatically.

Note that the minimum and maximum values only have any effect on the very first and last interpolating lines to create the Cumulative distribution, and so the distribution is less and less sensitive to the values chosen as more data are used in its construction. As the min and max values are chosen subjectively it is often sensible to use a subjective distribution (e.g. a PERT) for both, too.

Option 3: A histogram distribution

Sometimes (admittedly, not as often as we'd like) we have enormous amounts of random observations that we would like to construct a distribution from (for example, the generated values from another simulation). The DUniform and Cumul options above start to get a bit slow at that point, and model the variable in unnecessarily fine details. A more practical approach now is to create a histogram of the data and use that instead. The array function FREQUENCY( ) in Excel will analyse a data set and say how many lie with any number of contiguous bin ranges. The ModelRisk distribution VoseHistogram has three parameters: the minimum possible value, the maximum possible value, and an array of bin frequencies (or probabilities), which is just the FREQUENCY( ) array.

#### Creating an empirical joint distribution for two or more variables

For data that are collected in sets (pairs, triplets, etc), there may be correlation patterns inherent in the observations, and that we would like to maintain whilst fitting empirical distributions to data. An example is data of people's weight and  height, where there is clearly some relationship between them. A combination of using a StepUniform distribution (with min = 1 and max = number of rows) with an Excel VLOOKUP() or OFFSET( ) function allows us to do this easily.