Estimating the parameters of a least squares regression

Download a pdf copy of this help file  here


In least squares regression, one is attempting to model the change in one variable y (the response or dependent variable) as a function of one or more other variables {x} (the explanatory or independent variables). The regression relationship between {x} and y minimises the sum of squared errors between a fitted equation for y and the observations. The theory of least squares regression assumes the random variations about this line (due to effects not explained by the explanatory variables) to be Normally distributed with constant variance across all x values, which means the fitted line describes the mean y value for a given set of {x}. For simplicity we will consider a single explanatory variable x (i.e. simple regression analysis), and that the relationship between x and y is linear (which is linear regression analysis), i.e. we will use a model of the variability in y due to changes in x with the following equation:

                y = Normal(mx+c, s)

where m and c are the gradient and y-intercept of the straight line relationship between x and y, and s is the standard deviation of the additional variation observed in y that is not explained by the linear equation in x. The figure below illustrates these concepts.

In least squares linear regression, we typically have a set of n paired observations {xi, yi} for which we wish to fit this linear relationship.

Classical statistics provides us with the best fitting values for m, c and s, assuming the model's assumptions to be correct, and exact distributions of uncertainty for the estimate  = (mxP+c) at some value xP (see, for example, McClave et al., (1997)) and s as follows:

where:

;

t(n-2) is a Student-t distribution with (n-2) degrees or freedom;

c2(n-1) is a Chi-squared distribution with (n-1) degrees of freedom; and

s is the standard deviation of the differences ei between the observed value yi and its predictor , i.e.:

The uncertainty distribution for s is independent of the uncertainty distribution for (mx+c) since the model assumes that the random variations about the regression line are constant, i.e. that they are independent of the values of x and y. It turns out that these same results are given by Bayesian inference with uninformed priors, i.e. p(m,c,s) µ 1/s.

The uncertainty equation for = mxi+c produces a relationship between x and y with uncertainty that is pinched at the data's centre of gravity, and fans out the further one gets from the centre. This makes sense as the further we move towards the extremes of the set of observations, the more uncertainty we should be about the relationship. Strictly speaking, the theory of regression analysis says that the relationship can only be considered to hold within the range of observed values for x. However, with caution, one can reasonably extrapolate a little past the range of observed body values, though the further one extends beyond the observed range, the more tenuous the validity of the analysis becomes.