| Download a pdf copy of this help file here |
In least squares regression, one is attempting to model the change in one
variable y (the response or dependent variable) as a function of
one or more other variables {x} (the explanatory or independent
variables). The regression relationship between {x} and y
minimises the sum of squared errors between a fitted equation for y
and the observations. The theory of least squares regression assumes the
random variations about this line (due to effects not explained by the
explanatory variables) to be Normally distributed with constant variance
across all x values, which means the fitted line describes the
mean y value for a given set of {x}. For simplicity we will
consider a single explanatory variable x (i.e. simple regression
analysis), and that the relationship between x and y is
linear (which is linear regression analysis), i.e. we will use a model
of the variability in y due to changes in x with the following
equation:
y = Normal(mx+c, s)
where m and c are the gradient and y-intercept of the straight line relationship between x and y, and s is the standard deviation of the additional variation observed in y that is not explained by the linear equation in x. The figure below illustrates these concepts.

In least squares linear regression, we typically have a set of n paired observations {xi, yi} for which we wish to fit this linear relationship.
Classical statistics provides us with the best fitting values for m,
c and s,
assuming the model's assumptions to be correct, and exact distributions
of uncertainty for the estimate
=
(mxP+c) at some value xP (see, for example, McClave
et al., (1997)) and s
as follows:


where:
;
t(n-2) is a Student-t distribution with (n-2) degrees or freedom;
c2(n-1) is a Chi-squared distribution with (n-1) degrees of freedom; and
s
is the standard deviation of the differences
ei between the observed value yi and its predictor
,
i.e.:

The uncertainty distribution for s is independent of the uncertainty distribution for (mx+c) since the model assumes that the random variations about the regression line are constant, i.e. that they are independent of the values of x and y. It turns out that these same results are given by Bayesian inference with uninformed priors, i.e. p(m,c,s) µ 1/s.
The uncertainty equation for
= mxi+c produces a relationship between
x and y with uncertainty that is pinched at the data's centre
of gravity, and fans out the further one gets from the centre. This makes
sense as the further we move towards the extremes of the set of observations,
the more uncertainty we should be about the relationship. Strictly speaking,
the theory of regression analysis says that the relationship can only
be considered to hold within the range of observed values for x.
However, with caution, one can reasonably extrapolate a little past the
range of observed body values, though the further one extends beyond the
observed range, the more tenuous the validity of the analysis becomes.