## Model Selection Criteria

In this section we discuss several criteria that have been used to choose among competing models and/or to compare models for forecasting purposes. Here we distinguish between in-sample forecasting and out-of-sample forecasting. In-sample forecasting essentially tells us how the chosen model fits the data in a given sample. Out-of-sample forecasting is concerned with determining how a fitted model forecasts future values of the regressand, given the values of the regressors.

Several criteria are used for this purpose. In particular, we discuss these criteria: (1) R2, (2) adjusted R2( = R2), (3) Akaike information criterion (AIC), (4) Schwarz Information criterion (SIC), (5) Mallow's Cp criterion, and (6) forecast x2 (chi-square). All these criteria aim at minimizing the residual sum of squares (RRS) (or increasing the R2 value). However, except for the first criterion, criteria (2), (3), (4), and (5) impose a penalty for including an increasingly large number of regressors. Thus there is a tradeoff between goodness of fit of the model and its complexity (as judged by the number of regressors).

The R2 Criterion

We know that one of the measures of goodness of fit of a regression model is R2, which, as we know, is defined as:

R2, thus defined, of necessity lies between 0 and 1. The closer it is to 1, the better is the fit. But there are problems with R2. First, it measures in-sample goodness of fit in the sense of how close an estimated Y value is to its actual value in the given sample. There is no guarantee that it will forecast well out-of-sample observations. Second, in comparing two or more R2's, the dependent variable, or regressand, must be the same. Third, and more importantly, an R2 cannot fall when more variables are added to the model. Therefore, there is every temptation to play the game of "maximizing the R2" by simply adding more variables to the model. Of course, adding more variables to the model may increase R2 but it may also increase the variance of forecast error.

CHAPTER THIRTEEN: ECONOMETRIC MODELING 537

As a penalty for adding regressors to increase the R2 value, Henry Theil developed the adjusted R2, denoted by R2, which we studied in Chapter 7. Recall that

As you can see from this formula, R2 < R2, showing how the adjusted R2 penalizes for adding more regressors. As we noted in Chapter 8, unlike R2, the adjusted R2 will increase only if the absolute t value of the added variable is greater than 1. For comparative purposes, therefore, R2 is a better measure than R2. But again keep in mind that the regressand must be the same for the comparison to be valid.

Akaike Information Criterion (AIC)

The idea of imposing a penalty for adding regressors to the model has been carried further in the AIC criterion, which is defined as:

nn where k is the number of regressors (including the intercept) and n is the number of observations. For mathematical convenience, (13.9.3) is written as lnAIC = (2k) + ln( R|S) (13.9.4)

where lnAIC = natural log of AIC and 2k/n = penalty factor. Some textbooks and software packages define AIC only in terms of its log transform so there is no need to put ln before AIC. As you see from this formula, AIC imposes a harsher penalty than R2 for adding more regressors. In comparing two or more models, the model with the lowest value of AIC is preferred. One advantage of AIC is that it is useful for not only in-sample but also out-of-sample forecasting performance of a regression model. Also, it is useful for both nested and non-nested models. It has been also used to determine the lag length in an AR(p) model.

Schwarz Information Criterion (SIC)

Similar in spirit to the AIC, the SIC criterion is defined as:

538 PART TWO: RELAXING THE ASSUMPTIONS OF THE CLASSICAL MODEL

or in log-form:

nn where [(k/n) ln n] is the penalty factor. SIC imposes a harsher penalty than AIC, as is obvious from comparing (13.9.6) to (13.9.4). Like AIC, the lower the value of SIC, the better the model. Again, like AIC, SIC can be used to compare in-sample or out-of-sample forecasting performance of a model.

Mallows's Cp Criterion

Suppose we have a model consisting of k regressors, including the intercept. Let a2 as usual be the estimator of the true a2. But suppose that we only choose p regressors (p < k) and obtain the RSS from the regression using these p regressors. Let RSSp denote the residual sum of squares using the p regressors. Now C. P. Mallows has developed the following criterion for model selection, known as the Cp criterion:

a where n is the number of observations.

We know that E(a2) is an unbiased estimator of the true a2. Now, if the model with p regressors is adequate in that it does not suffer from lack of fit, it can be shown38 that E(RSSp) = (n — p)a2. In consequence, it is true approximately that

In choosing a model according to the Cp criterion, we would look for a model that has a low Cp value, about equal to p. In other words, following the principle of parsimony, we will choose a model with p regressors (p < k) that gives a fairly good fit to the data.

In practice, one usually plots Cp computed from (13.9.7) against p. An "adequate" model will show up as a point close to the Cp = p line, as can be seen from Figure 13.3. As this figure shows, Model A may be preferable to Model B, as it is closer to the Cp = p line than Model B.

A Word of Caution about Model Selection Criteria

We have discussed several model selection criteria. But one should look at these criteria as an adjunct to the various specification tests we have

38Norman D. Draper and Harry Smith, Applied Regression Analysis, 3d ed., John Wiley & Sons, New York, 1998, p. 332. See this book for some worked examples of Cp. discussed in this chapter. Some of the criteria discussed above are purely descriptive and may not have strong theoretical properties. Some of them may even be open to the charge of data mining. Nonetheless, they are so frequently used by the practitioner that the reader should be aware of them. No one of these criteria is necessarily superior to the others.39 Most modern software packages now include R2, adjusted R2, AIC, and SIC. Mallows's Cp is not routinely given, although it can be easily computed from its definition.

Forecast Chi-Square (x2)

Suppose we have a regression model based on n observations and suppose we want to use it to forecast the (mean) values of the regressand for an additional t observations. As noted elsewhere, it is a good idea to save part of the sample data to see how the estimated model forecasts the observations not included in the sample, the postsample period. Now the forecast x2 test is defined as follows:

where u is the forecast error made for period i (= n + 1, n + 2,..., + n +t), using the parameters obtained from the fitted regression and the values of the regressors in the postsample period. o2 is the usual OLS estimator of a2 based on the fitted regression.

39For a useful discussion on this topic, see Francis X. Diebold, Elements of Forecasting, 2d ed., South Western Publishing, 2001, pp. 83-89. On balance, Diebold recommends the SIC criterion.

540 PART TWO: RELAXING THE ASSUMPTIONS OF THE CLASSICAL MODEL

If we hypothesize that the parameter values have not changed between the sample and postsample periods, it can be shown that the statistic given in (13.9.9) follows the chi-square distribution with t degrees of freedom, where t is the number of periods for which the forecast is made. As Charemza and Deadman note, the forecast x2 test has weak statistical power, meaning that the probability that the test will correctly reject a false null hypothesis is low and therefore the test should be used as a signal rather than a definitive test.40 