Correlation with project schedule risk analysis

What is correlation?

Correlation is the degree to which the observed value of two or more random variables are related. If two random variables are uncorrelated, the observed value of the second variable will not be affected by the value of the first, and vice versa. If one variable is likely to take a high (low) value when the other variable takes a high (low) value, the two variables are positively correlated. If the opposite occurs (a high value for one variable is likely to occur at the same time as a low value for the other variable) the two variables are negatively correlated.

Why is correlation important?

Correlation between task durations most often increases the spread in the distribution of how long it will take to reach a milestone or finish. Thus, ignoring correlation effects will generally underestimate the risk of project overrun, perhaps to a very large degree.

Correlation coefficient

The degree of correlation is usually measured by a statistic known as the correlation coefficient which varies between -1 and +1 so that:

A correlation of -1 means that if the observation from one random variable is at the P^th percentile, the observation of the other will be at exactly the (100-P)^th percentile;
A correlation between -1 and 0 means that if the observation from one random variable is at the P^th percentile, the observation of the other will most likely be around the (100-P)^th percentile, but could still lie anywhere between the 0^th and 100^th;
A correlation of 0 means that if the observation from one random variable is at the P^th percentile, the observation of the other will be anywhere between the 0^th and 100^th percentile with equal probability;
A correlation between 0 and 1 means that if the observation from one random variable is at the P^th percentile, the observation of the other will most likely be around the P^th percentile, but could still lie anywhere between the 0^th and 100^th;
A correlation of +1 means that if the observation from one random variable is at the P^th percentile, the observation of the other will also be at exactly the P^th percentile;

Older correlation methods and why they don’t work

In project schedules, the durations of tasks are often strongly correlated. Older schedule risk analysis software uses correlation coefficients to model these effects, and then uses simulation algorithms to produce the required level of correlation between random samples. This method has four critical shortcomings:

Correlation coefficients are mathematical concepts and have no natural interpretation in the real world, which means that correlation coefficient estimates provided by subject matter experts are very unreliable;
With no natural interpretation, the project manager gains no insight into why the correlation exists and how it can be managed;
Only certain combinations of correlation coefficients are possible. For example, if A and B are highly positively correlated, and C is highly negatively correlated with A it must also be highly negatively correlated with B. Trying to specify the allowable correlation levels becomes very difficult indeed when there are several tasks;
The number of correlation coefficients required to correlate n different task durations is given by the equation n(n-1)/2. This number increase to impractical levels very quickly:

In practice, it becomes an unrealistic task to incorporate correlation for schedules with more than about 10 correlated tasks, which is far too small for most projects of course. The result is that analysts have been forced not to include correlation effects. Since this will tend to result in an unrealistically small estimate of schedule uncertainty, one could argue that this is more dangerous than not performing a schedule risk assessment at all.

Tamara’s approach to correlation

In the real world, correlation exists between tasks because of some shared influencing factor(s). Tamara asks the user to explore what those factors might be and to quantify them in an intuitive manner. No correlation coefficients are used.

In Tamara, there are two modeling capabilities that will naturally induce correlation between task durations. The first is simply specifying where a risk event can impact two or more tasks, so if the risk event occurs then all tasks to which the risk is connected will be delayed, and if the risk doesn’t occur they won’t.

The second modelling capability is where productivity risk factors are common to two or more tasks. For example, imagine that we have a design project consisting of just two tasks for simplicity:

Task 1: Design building

Duration with scope uncertainty: min = 90 days, mode = 100 days, max = 130 days

Task 2: Design infrastructure

Duration with scope uncertainty: min = 80 days, mode = 100 days, max = 140 days

Now let’s add ‘Communication with client’ as a Productivity Risk Factor, meaning that if there is good communication it can speed up the design process by removing unnecessary work, and if the communication is bad it will slow it down. Both tasks are affected:

Communication with client

Effect on task duration: min = -10%, mode = 0%, max = +25%

This is stating that there could be a reduction of up to 10% of the time it takes to do an item of work if the communication is good, and up to 25% longer if it’s bad.

We make a Work Type Category consisting of just this one Productivity Risk Factor:

And apply it to both tasks:

Behind the scenes Tamara will simulate the task durations correlated via the Productivity Risk Factor, which can be shown as a scatter plot. The left plot shows how the Productivity Risk Factor has induced a positive correlation so that larger (smaller) observations tend to occur together; whilst the right plot shows what the pattern would have looked like with the same duration distributions (the combination of scope and productivity uncertainty) but no correlation:

The induced level of correlation in this example is 32%, and is dependent on the balance of individual scope uncertainty for each task compared to the shared productivity uncertainty. If the scope uncertainty was very small, the productivity uncertainty would dominate and the correlation would be close to 100%. Conversely, if the scope uncertainty was huge and the productivity uncertainty minimal, the correlation would be close to 0%.

Let’s now look at the effect of introducing this correlation via the Productivity Risk Factor. The following plots show the cumulative probability distribution of the project duration: on the left is the result if the tasks were completed in parallel, and on the right if they were done one after the other (serial):

The results are not immediately intuitive. When the tasks are done in parallel, the correlated tasks (orange line) take on average less time to complete than when they are uncorrelated (blue line). This is because the project is complete when the last task is finished. If the tasks are uncorrelated there are two independent chances for a task to exceed say 110 days, but if they are correlated then if one task takes more than 110 days, the other task is also likely to do so, and vice versa.

When the tasks are done one after the other (serial) the average duration of the project is the same whether the tasks are correlated or not. However - if they are correlated - when one task takes more (less) time the other task is also likely to take more (less) time so the duration in total will be longer (shorter). If they are uncorrelated, there is more chance that a long duration of one task will be compensated by a short duration of the other. Thus, correlated tasks when performed serially will produce a wider distribution of finish time, and therefore a higher risk of failing to meet a delivery date, than uncorrelated tasks.