## Eu2

is known as the standard error of estimate or the standard error of the regression (se). It is simply the standard deviation of the Y values about the estimated regression line and is often used as a summary measure of the "goodness of fit" of the estimated regression line, a topic discussed in Section 3.5.

Earlier we noted that, given Xi, a2 represents the (conditional) variance of both ui and Yi. Therefore, the standard error of the estimate can also be called the (conditional) standard deviation of ui and Yi. Of course, as usual, aY2 and aY represent, respectively, the unconditional variance and unconditional standard deviation of Y.

Note the following features of the variances (and therefore the standard errors) of p1 and p2.

1. The variance of p2 is directly proportional to a2 but inversely proportional to £ xf. That is, given a2, the larger the variation in the X values, the smaller the variance of p2 and hence the greater the precision with which p2 can be estimated. In short, given a2, if there is substantial variation in the X values (recall Assumption 8), p2 can be measured more accurately than when the Xi do not vary substantially. Also, given xi2, the larger the variance of a2, the larger the variance of p2. Note that as the sample size n increases, the number of terms in the sum, xi2, will increase. As n increases, the precision with which p2 can be estimated also increases. (Why?)

2. The variance of p1 is directly proportional to a2 and £ X2 but inversely proportional to xi2 and the sample size n.

3. Since p1 and p2 are estimators, they will not only vary from sample to sample but in a given sample they are likely to be dependent on each other, this dependence being measured by the covariance between them. It is shown in Appendix 3A, Section 3A.4 that cov(Pi, P2) = -Xvar (P2)

CHAPTER THREE: TWO-VARIABLE REGRESSION MODEL 79

Since var (fa2) is always positive, as is the variance of any variable, the nature of the covariance between fa and fa depends on the sign of X If X is positive, then as the formula shows, the covariance will be negative. Thus, if the slope coefficient fa2 is overestimated (i.e., the slope is too steep), the intercept coefficient fa1 will be underestimated (i.e., the intercept will be too small). Later on (especially in the chapter on multicollinearity, Chapter 10), we will see the utility of studying the covariances between the estimated regression coefficients.

How do the variances and standard errors of the estimated regression coefficients enable one to judge the reliability of these estimates? This is a problem in statistical inference, and it will be pursued in Chapters 4 and 5.

3.4 PROPERTIES OF LEAST-SQUARES ESTIMATORS: THE GAUSS-MARKOV THEOREM19

As noted earlier, given the assumptions of the classical linear regression model, the least-squares estimates possess some ideal or optimum properties. These properties are contained in the well-known Gauss-Markov theorem. To understand this theorem, we need to consider the best linear unbiasedness property of an estimator.20 As explained in Appendix A, an estimator, say the OLS estimator fa2, is said to be a best linear unbiased estimator (BLUE) of fa2 if the following hold:

1. It is linear, that is, a linear function of a random variable, such as the dependent variable Y in the regression model.

2. It is unbiased, that is, its average or expected value, E(fa2), is equal to the true value, fa2.

3. It has minimum variance in the class of all such linear unbiased estimators; an unbiased estimator with the least variance is known as an efficient estimator.

In the regression context it can be proved that the OLS estimators are BLUE. This is the gist of the famous Gauss-Markov theorem, which can be stated as follows:

Gauss-Markov Theorem: Given the assumptions of the classical linear regression model, the least-squares estimators, in the class of unbiased linear estimators, have minimum variance, that is, they are BLUE.

The proof of this theorem is sketched in Appendix 3A, Section 3A.6. The full import of the Gauss-Markov theorem will become clearer as we move

'^Although known as the Gauss-Markov theorem, the least-squares approach of Gauss antedates (1821) the minimum-variance approach of Markov (1900).

20The reader should refer to App. A for the importance of linear estimators as well as for a general discussion of the desirable properties of statistical estimators.

80 PART ONE: SINGLE-EQUATION REGRESSION MODELS (a) Sampling distribution of p* (b) Sampling distribution of p* along. It is sufficient to note here that the theorem has theoretical as well as practical importance.21

What all this means can be explained with the aid of Figure 3.8.

In Figure 3.8(a) we have shown the sampling distribution of the OLS estimator f2, that is, the distribution of the values taken by f2 in repeated sampling experiments (recall Table 3.1). For convenience we have assumed f2 to be distributed symmetrically (but more on this in Chapter 4). As the figure shows, the mean of the ff2 values, E(f2), is equal to the true f2. In this situation we say that ff2 is an unbiased estimator of f2. In Figure 3.8(b) we have shown the sampling distribution of f, an alternative estimator of ff2

21For example, it can be proved that any linear combination of the ff's, such as (f — 2f2), can be estimated by (fi — 2f2), and this estimator is BLUE. For details, see Henri Theil, Introduction to Econometrics, Prentice-Hall, Englewood Cliffs, N.J., 1978, pp. 401-402. Note a technical point about the Gauss-Markov theorem: It provides only the sufficient (but not necessary) condition for OLS to be efficient. I am indebted to Michael McAleer of the University of Western Australia for bringing this point to my attention.

obtained by using another (i.e., other than OLS) method. For convenience, assume that ft2, like ft, is unbiased, that is, its average or expected value is equal to ft. Assume further that both ft and ft are linear estimators, that is, they are linear functions of Y. Which estimator, ft or ft, would you choose?

To answer this question, superimpose the two figures, as in Figure 3.8(c). It is obvious that although both ft2 and ft2 are unbiased the distribution of ft2 is more diffused or widespread around the mean value than the distribution of ft2. In other words, the variance of ft2 is larger than the variance of ft2. Now given two estimators that are both linear and unbiased, one would choose the estimator with the smaller variance because it is more likely to be close to ft2 than the alternative estimator. In short, one would choose the BLUE estimator.

The Gauss-Markov theorem is remarkable in that it makes no assumptions about the probability distribution of the random variable ui, and therefore of Yi (in the next chapter we will take this up). As long as the assumptions of CLRM are satisfied, the theorem holds. As a result, we need not look for another linear unbiased estimator, for we will not find such an estimator whose variance is smaller than the OLS estimator. Of course, if one or more of these assumptions do not hold, the theorem is invalid. For example, if we consider nonlinear-in-the-parameter regression models (which are discussed in Chapter 14), we may be able to obtain estimators that may perform better than the OLS estimators. Also, as we will show in the chapter on heteroscedasticity, if the assumption of homoscedastic variance is not fulfilled, the OLS estimators, although unbiased and consistent, are no longer minimum variance estimators even in the class of linear estimators.

The statistical properties that we have just discussed are known as finite sample properties: These properties hold regardless of the sample size on which the estimators are based. Later we will have occasions to consider the asymptotic properties, that is, properties that hold only if the sample size is very large (technically, infinite). A general discussion of finite-sample and large-sample properties of estimators is given in Appendix A.

Thus far we were concerned with the problem of estimating regression coefficients, their standard errors, and some of their properties. We now consider the goodness of fit of the fitted regression line to a set of data; that is, we shall find out how "well" the sample regression line fits the data. From Figure 3.1 it is clear that if all the observations were to lie on the regression line, we would obtain a "perfect" fit, but this is rarely the case. Generally, there will be some positive £ii and some negative £ii. What we hope for is that these residuals around the regression line are as small as possible. The coefficient of determination r2 (two-variable case) or R2 (multiple regression) is a summary measure that tells how well the sample regression line fits the data.

3.5 THE COEFFICIENT OF DETERMINATION r2: A MEASURE OF "GOODNESS OF FIT"

82 PART ONE: SINGLE-EQUATION REGRESSION MODELS

FIGURE 3.9 The Ballentine view of r2: (a) r2 = 0; (f) r2 = 1.

Before we show how r2 is computed, let us consider a heuristic explanation of r2 in terms of a graphical device, known as the Venn diagram, or the Ballentine, as shown in Figure 3.9.22

In this figure the circle Y represents variation in the dependent variable Y and the circle X represents variation in the explanatory variable X.23 The overlap of the two circles (the shaded area) indicates the extent to which the variation in Y is explained by the variation in X (say, via an OLS regression). The greater the extent of the overlap, the greater the variation in Y is explained by X. The r2 is simply a numerical measure of this overlap. In the figure, as we move from left to right, the area of the overlap increases, that is, successively a greater proportion of the variation in Y is explained by X. In short, r2 increases. When there is no overlap, r2 is obviously zero, but when the overlap is complete, r2 is 1, since 100 percent of the variation in Y is explained by X. As we shall show shortly, r2 lies between 0 and 1. To compute this r2, we proceed as follows: Recall that

where use is made of (3.1.13) and (3.1.14). Squaring (3.5.1) on both sides

22See Peter Kennedy, "Ballentine: A Graphical Aid for Econometrics," Australian Economics Papers, vol. 20, 1981, pp. 414-416. The name Ballentine is derived from the emblem of the well-known Ballantine beer with its circles.

23The term variation and variance are different. Variation means the sum of squares of the deviations of a variable from its mean value. Variance is this sum of squares divided by the appropriate degrees of freedom. In short, variance = variation/df.

and summing over the sample, we obtain 