Aggregate distributions introduction

Introduction

We are very frequently interested in the value of the total of a number of independent identically distributed random variables (sometimes known as iid's in statistics). By 'independent', we mean that each random variable will take a value that is not influenced by the value of any of the other random variables. Some general examples are:

The total purchases by n customers where we know the probability distribution of the purchase amount from a random customer.
The amount of water drunk by the citizens of a town of size n where we know the probability distribution of the amount of water drunk by a random citizen.
The price of yearly maintenance of a fleet of n similar vehicles where we know the probability distribution of the yearly maintenance cost of a random, similar vehicle.

In particular insurance and finance risk analysis frequently requires the determination of the sum of random variables. The resulting distribution is called the aggregate distribution. For example:

The aggregate claim distribution for a portfolio of policies over a certain period.
The total claim distribution for an individual over a period.
The total exposure from a set of investments.

A frequent mistake in risk modeling is to take the probability distribution of the individual iid and multiply by n. For example, if the purchase amount of a random customer is known to be $Lognormal(20,35), and we want to know the total purchase for 1000 customers, we might think to perform the following calculation:

Total purchase ($) = VoseLogNormal(20,35)*1000

The Lognormal(20,34) distribution has a 8.6% probability that a random purchase is above $50 and a 33.5% probability of being below $6. The VoseLogNormal(20,35)*1000 formula will generate a value of $50,000 with 8.6% probability, i.e. it would be allocating the same probability of all 1000 customers coming in and spending around $50+ as just one customer. The formula ignores the fact that some customers will make small purchases and others large purchase, so they will even out to something closer to the average of $20.

The correct way to model the total customer purchases is to individually generate 1000 LogNormal(20,35) variables and add them all together.

In ModelRisk this is done with the VoseAggregateMC function. To add 1000 LogNormal(20,35) distributed variables together you write

=VoseAggregateMC(1000,VoseLogNormalObject(20,35))

For extremely large N we can also use Central Limit Theorem (CLT), which states that the sum will be distributed Normal(1000*20, SQRT(1000)*35). So we write VoseNormal(1000,20,sqrt(1000)*35) or more conveniently VoseCLTsum(1000,20,35).

Note that you can use the VoseMean(DistributionObject) and VoseStDev(DistributionObject) functions for distributions where the mean and standard deviation (needed for CLT) are not as easily obtained as with the LogNormal.

The figure below plots the cumulative distributions of the incorrect [=VoseLogNormal(20,35)*1000] and correct [VoseAggregateMC and VoseCLTSum distributions] ways of modelling this sum.

The results were obtained by running 10,000 iterations. The results have the same mean, but the correct method has a far narrower distribution. In fact, it has a standard deviation that is , or about 32 times smaller! The almost perfect overlay between the two correct methods also shows that the CLT approximation works extremely well because a large number of distributions (1000) are being added together, even though those distributions (Lognormal(20,35)) are quite skewed.

One of the most common mistakes people make in producing even the most simple Monte Carlo simulation model is in calculating sums of random variables.

The techniques explained here have extremely broad use in risk analysis in estimating the sum of random variables. Since they are of great importance in insurance and finance, the examples and applications outlined here will often focus on that. However, the importance of aggregate modeling extends to about every field in risk analysis!

Aggregate modeling is one of the strong points in ModelRisk , so once you have a good understanding of the subject, have a look at how easily aggregate modeling with ModelRisk goes - you might be surprised.

A closer look: the 6 situations of summing random variables

Essentially, we have six situations to deal with:

Situation	N	X
A	Fixed value	Fixed value
B	Fixed value	Random variable, all n take same value
C	Fixed value	Random variable, all n take different values (iids)
D	Random variable	Fixed value
E	Random variable	Random variable, all n take same value
F	Random variable	Random variable, all n take different values (iids)

Now let is look more closely on how to approach the above situations.

Situations A, B, D and E

For situations A B, D and E the mathematics is very easy to simulate:

SUM = n * X

Situation C

When the X are independent random variables (i.e. each X being summed can take a different value) and n is fixed, we often have a simple way to determine the aggregate distribution based on known identities. The most common identities are listed in the following table:

Common aggregate identities

X	Aggregate distribution
Bernoulli(p)	Binomial(n,p)
BetaBinomial(m,a,b)	BetaBinomial(n*m,a,b)
Binomial(m,p)	Binomial(n*m,p)
Cauchy(a,b)	n* Cauchy(a,b)
ChiSquared(n)	ChiSquared(n*n)
Erlang(m,b)	Erlang(n*m,b)
Exponential(b)	Gamma(n,b)
Gamma(a,b)	Gamma(n*a,b)
Geometric(p)	NegBin(n,p)
Levy(c,a)	n²*Levy(c,a)
NegBin(s,p)	NegBin(n*s,p)
Normal(m,s)	Normal(nm,SQRT(n)s)
Poisson(l)	Poisson(n*l)
Student(n)	Student(n*n)

We also know from Central Limit Theorem that if n is large enough the sum will often look like a Normal distribution. If X has a mean m and standard deviation s, then as n becomes large we get:

which is rather nice because it means we can have a distribution like the Relative and determine the moments (the VoseMoments function from ModelRisk will do this automatically for you), or just the mean and standard deviation of relevant observations of X and use them. It also explains why the distributions in the right-hand column of the above table often look approximately Normal.

When none of these identities apply we have to simulate of column of X variables of length n and add them up, which is usually not too onerous in computing time of spreadsheet size because if n is large we can usually use the Central Limit Theorem approximation.

An alternative for situation C available in ModelRisk is to use the VoseAggregateMC function: for example, if we write:

=VoseAggregateMC(1000, VoseLognormalObject(2,6))

the function will generate and add together 1000 independent random samples from a Lognormal(2,6) distribution. However, if we wrote:

=VoseAggregateMC(1000, VoseGammaObject(2,6))

the function would generate a single value from a Gamma(2*1000,6) distribution because all of the identities in the table are programmed into the function.

Situation F

This leaves us with situation F - the sum of a random number of random variables.

A Beta4(3,7,0,8) distribution

We have a couple of options based on the techniques described above for situation C. If we are adding together X variables shown in the common aggregate identities table, then we can apply those identities by simulating n in one cell and linking that to a cell that simulates from the aggregate variable conditioned on n.

For example, imagine we are summing Poisson(100) X variables where each X variable takes a Gamma(2,6) distribution. Then we can write the following::

Cell A1:

=VosePoisson(100)

Cell A2 (output):

=VoseAggregateMC(A1,VoseGammaObject(2,6))

We can also use the Central Limit Theorem method.

Imagine we have n = Poisson(1000) and X = Beta4(3,7,0,8), which looks like the figure on the right.

The distribution is not terribly asymmetric so adding roughly 1000 of them will look very close to a Normal distribution which means that we can be confident in applying the Central Limit Theorem approximation, shown in the example model below.

Example model Aggregate_CLT for the Central Limit Theorem approximation

Here we have made use of the VoseMoments array function which returns the moments of a distribution object.

The VoseCLTSum performs the same calculation as that shown in F5 but is a little more intuitive. The VoseAggregateMC alternatively will, in this iteration, add together 999 values drawn from the Beta4 distribution because there is no known identity for sums of Beta4 distributions.

Methods for constructing aggregate distributions

There exist a range of very neat techniques for constructing the aggregate distribution when n is a random variable and X are independent identically distributed random variables. There are a lot of advantages to being able to construct such an aggregate distribution, among which are:

We can determine tail probabilities to a high precision.
It is much faster than Monte Carlo simulation.
We can manipulate the aggregate distribution as with any other in Monte Carlo simulation, e.g. correlate it with other variables.

The main disadvantage to these methods is that they are computationally intensive and need to run calculations through often very long arrays. All methods are implemented in ModelRisk, however, which runs the calculations internally, and optimized for speed.

Two methods of first interest are Panjer recursive method, and Fast Fourier Transform (FFT) method. These two have a similar feel to them, and similar applications, though their mathematics is quite different. Then we'll look at a multivariate FFT method which allows us to extend the aggregate calculation to a set of {n,X} variables. The de Pril recursive method is similar to Panjer's and has specific use.

A quick and useful way to check if an approximated aggregate distribution is accurate, is to compare its moments with the exact theoretical moments of the aggregate distribution, since these can be calculated directly. Therefore it is useful to have a closer look at these first.

Read on: Moments of an aggregate distribution

Aggregate distributions introduction

Introduction

A closer look: the 6 situations of summing random variables

Methods for constructing aggregate distributions

Navigation