Multivariate trials

We sometimes need to recognise the inter-relationship between probabilities of values for two or more distributions.

In other words, these distributions are not independent of each other.

Some ModelRisk features and other modelling methods allow us to crudely model correlations between several distributions. However, there are certain situations where specific multinomial distributions are needed.

The following three common multivariate distributions are described here:

Multinomial
Dirichlet
and Multivariate Hypergeometric

Multinomial distribution

For a set of n trials, each of which could take one of k different outcomes (like different colours of balls in a huge urn) with probabilities p1..pk, the distribution of the outcomes is known as multinomial, which is just an extension of the binomial distribution. The only difference is the number of possible outcomes: only two for the binomial and multiple for the multinomial. The Multinomial distribution has the following probability mass function:

is sometimes known as Multinomial coefficient.

Let's consider the following problem: The cars in a city are divided into 9 different categories. We know the proportion of the city's cars that are in each category. If we were to monitor 1000 cars that enter a particular motorway, how many cars of each category would we see?

This is clearly a problem of multinomial trials since every car that enters the motorway can be any one of 9 types.

To sample from a multinomial distribution we need to proceed as follows:

We know p1, p2,...,pk (proportions of each type) and n (our sample size - 1000).

First we simulate from binomial(n, p1) - this gives us s1.

For each remaining category, we simulate s2, s3, ...,, sk in order with sj = binomial(n-SUM(s1...sj-1),pj/SUM(pj...pk))

Note that the marginal distribution for sj (i.e. the distribution of generated values for sj when looked at by itself) is simply a binomial(n,pj).

So, our model looks like this:

Our first category is simulated in cell C11, which is just a binomial distribution : VoseBinomial(1000, 5%)

As the second category now needs to take into account the result from the first type, the formula in cell D11 becomes as shown above - number of trials is decreased by the number of successes from the first category, and the probability of success becomes the probability of category two divided by the sum of the probabilities of the remaining 8 categories.

This logic is consistent throughout the "Successes" row (cells D11: K11), and the row "Outputs" shows a nice way of naming the output cells.

Dirichlet distribution

The conjugate to the multinomial distribution is the Dirichlet distribution, much like the beta distribution is the conjugate to the binomial distribution. The Dirichlet distribution is used for modelling the uncertainty around probabilities of successes in multinomial trials.

The Dirichlet distribution has the following probability density function:

For example, if you've observed s1, s2, ... sk of different types of outcomes from n trials, the Dirichlet distribution provides the confidence distribution about the correct values for the probability that a random trial will produce each type of outcome by setting a1 = s1+1,. Obviously these probabilities have to sum to 1, so their uncertainty distributions are inter-related.

Let's take the same problem that we used in the previous example: All cars in a city are divided into 9 different types. But now we have monitored 1000 cars that were entering a particular motorway, and counted the number of cars of each type. What is the uncertainty distribution for the proportions of each type in the total population of cars?

Putting the above logic into a spreadsheet model looks like this:

The second part of the equation in cells D10 to J10 follow similar logic as C10, which is then multiplied by the (1 - sum of the previous cells in the same row). The last Cell K10 calculates the implied probability for the last category as 1-sum(C10:J10).

The Dirichlet distribution is not as intuitive as the Multinomial distribution, but it is a very handy tool when modelling multinomial trials.

Multivariate hypergeometric

Sometimes we need to model sampling from a population without replacement with multiple outcomes and when the population is small so the process cannot be approximated to a multinomial where the probabilities of success remain constant. In this case we use the multivariate hypergeometric distribution, which is similar to the hypergeometric distribution, with the difference in the number of possible outcomes from a trial (two - in the hypergeometric and many - in the multivariate hypergeometric).

The figure below shows the graphical representation of the multivariate hypergeometric process. D1, D2, D3 and so on are the number of individuals of different types in a population, and x1, x2, x3, ... are the number of successes (the number of individuals in our random sample (circled) belonging to each category).

The Multivariate hypergeometric distribution has the following probability mass function:

, where

Let's imagine a problem where we have 100 coloured balls in a bag, from which 10 are red, 15 purple, 20 blue, 25 green and 30 yellow. Without looking into the bag, you take 30 balls out. How many balls of each colour will you take from the bag?

We cannot model this problem using the multivariate distribution, because when we take the first ball out, the proportions of the different colour balls in the bag change. The same happen when we take the second ball out and so on.

Thus, we must proceed as follows:

Model the first colour (red for example) as x1= Hypergeometric(s, D1 ,M) , where s is the sample size = 30, D1 is the total number of red balls in the bag = 10, and M is the population size - 100
Model the rest as: xi = Hypergeometric (s - SUM(x1: xi-1), Di , SUM(Di : Dn)) , where xi is the number of successes of the type i in a sample, xi-1 is the number of successes of the type i-1 in a sample, Di number of successes of type i in the total population, Dn in the number of successes of the last type in the total population.

Multivariate trials

Navigation