| Download a pdf copy of this help file here |
See also: Most common mistakes introduction
At school we learn things like:
1 + 3 = 4 so 4 - 1 = 3
3 * 2 = 6 so 6 / 3 = 2
Later, when we take algebra, we learn:
A + B = C therefore C - A = B
D*E = F therefore F / D = E
The problem is that these trusted rules do not apply so universally when manipulating random variables. This section explains how and when these simple algebraic rules no longer work, shows you how to identify them in your model, and how to make the appropriate corrections.
Most deterministic spreadsheet models consist of linked formulae that contain nothing more complicated than simple operations like +, -, * and /. When we decide to start adding uncertainty to the values of the components in the model, it seems natural enough to simply replace a fixed value by a probability distribution describing our uncertainty. So, for example, the simple model for a company offering some credit service:

could be 'risked' and replaced
by this
example model:

The best guess estimates of the money borrowed by a client and for the number of clients have been replaced by distributions, but the model is otherwise unchanged. This model is probably very wrong. Why? Well, the error is most easily seen by watching random values being generated on-screen. Look at the values that are being used for the entire client base and compare with where these values sit on the Lognormal(10000,4000) distribution.

For example,the Lognormal(10000,4000) distribution has 10% of its probability below 5,670. Thus, in 10% of it's iterations it will generate a value below this figure and that value will be used for all customers. The Lognormal distribution undoubtedly reflects the variability that is expected between customers (perhaps, for example, it was fit to a relevant data set of amounts individual customers have previously borrowed).
The probability that two randomly selected customers will borrow less that 5,670 is 10% * 10% = 1% (0.1 * 0.1 = 0.01 if you prefer). The probability that all (say) 6500 customers borrow less than 5,670 if the amounts they borrow are independent is 10-6500 i.e. effectively impossible, yet our model gives it a 10% probability.
In order to model this problem correctly we need to consider what are the sources of uncertainty about the amount a customer borrowed. If the source is specific to each individual client, then the amounts can be considered independent, and the VoseAggregateMC function provides the correct approach. If there is some systematic influence (like the state of the economy, recent bad press for companies offering credit, etc.) it will have to be separated out from the individual, independent component.
Say
we take the sum of two independent Uniform(0,1)
distributions, what do you think the resulting distribution would be?
The answer often surprises people. It is hard to imagine a simpler problem, yet when we canvass a class we get quite a range of answers.
Perhaps a Uniform(0,2)? That's the most common response. Or something looking a little Normal maybe? That is actually not such a bad guess.
The correct answer is a Triangle(0,1,2).
So we could write:
U(0,1) + U(0,1) = T(0,1,2)
The first message in this example is that it is difficult for a person not very well versed in risk analysis modeling to be able to predict well the results of even the most trivial model. Of course, that makes it very hard to check the model and be comfortable about its results.
On to the next question we often pose our class:
T(0,1,2) - U(0,1) = ?
Now wise to the trickiness of the question, most class participants are pretty sure that their first guess (i.e. = U(0,1) ) is wrong but don't have anything else to suggest. The answer is a symmetric distribution that looks a little Normal, stretching from -1 to 2 with peak at 0.5. But why isn't it U(0,1)? An easy way to visualize this is to run a simulation adding two Uniform(0,1) distributions and plotting the generated values from one Uniform distribution together with the calculated sum of them both. You get a scatter plot that looks like this:

Note the red line y=x shows the lowest value for the Triangle distribution for any given value of the Uniform distribution, and the green line y = 1+x is the highest value, which makes intuitive sense. The vertical spread between the two lines is the effect of the second Uniform(0,1) distribution.
Also note that all the generated values lie uniformly (but randomly) between these two lines. This actually is quite helpful in visualizing why the sum of two Uniform(0,1) distributions is a Triangle(0,1,2) by projecting all the dots onto the y-axis. Can you extend this graph to work out graphically what U(0,1) + U(0,3) would look like? (the answer is explained below)
The point of the graph is to show you that there is a strong dependency pattern between these two distributions (a Uniform and the Triangle sum), which would need to be taken into account if one wished to extract back out the two Uniform distributions from each other. For example, the formulae below do just that:
A: =VoseTriangle(0,1,2)
B: =VoseUniform(IF(A<1,0,A-1),IF(A>1,1,A))
C: =A - B
Try to follow the logic for the formula for B from the graph. B will generate a Uniform(0,1) distribution with the right dependency relationship with A to leave C a Uniform(0,1) distribution too.
To recap, the problem is that we have three variables linked together as follows:
A + B = C
We know the distributions for A and C, how do we find B and how do we simulate A, B and C all together? The simple example above using two Uniform distributions allows us to simulate A, B and C all together, but only because we assumed A and B were independent, and the problem was very simple. In general, we cannot correctly determine B, so we need either to construct a model that avoids having to perform such a calculation, or admit that we have insufficient information to specify B.
Add Uniform(0,1) and Uniform(0,3) distributions together. Plot the Uniform(0,3) generated values with the calculated sum, to get the graph below.

Note that the scatter-plot has equal density between the red (y = x) and green (y = x+1) lines. In the y-range between 0 and 1 the scatter-point area is increasing linearly, then between y = 1 and y = 3 it is constant, and from y = 3 to y = 4 the scatter-point area decreases linearly to zero. Thus the sum takes a trapezoid shape:
