## Instrumental Variables Estimation

An alternative set of results for estimation in this model (and numerous others) is built around the method of instrumental variables. Consider once again the errors in variables model in (5-25) and (5-26a,b). The parameters, 0, a2, q*, and a2 are not identified in terms of the moments of x and y. Suppose, however, that there exists a variable z such that z is correlated with x* but not with u. For example, in surveys of families, income is notoriously badly reported, partly deliberately and partly because respondents often neglect some minor sources. Suppose, however, that one could determine the total amount of checks written by the head(s) of the household. It is quite likely that this z would be highly correlated with income, but perhaps not significantly correlated with the errors of measurement. If Cov[x*, z] is not zero, then the parameters of the model become estimable, as plim (1/n) = 0^4 = (5-32)

In a multiple regression framework, if only a single variable is measured with error, then the preceding can be applied to that variable and the remaining variables can serve as their own instruments. If more than one variable is measured with error, then the first preceding proposal will be cumbersome at best, whereas the second can be applied to each.

For the general case, y = X*0 + e, X = X* + U, suppose that there exists a matrix of variables Z that is not correlated with the disturbances or the measurement error but is correlated with regressors, X. Then the instrumental variables estimator based on Z, bIV = (Z'X)-1Z'y, is consistent and asymptotically normally distributed with asymptotic covariance matrix that is estimated with

For more general cases, Theorem 5.3 and the results in Section 5.4 apply.

11Use (A-66) to invert [Q* + Euu] = [Q* + (ffue1)(ffue1)'], where e1 is the first column of a K x K identity matrix. The remaining results are then straightforward.

12This point is important to remember when the presence of measurement error is suspected.

13Some firm analytic results have been obtained by Levi (1973), Theil (1961), Klepper and Leamer (1983), Garber and Klepper (1980), and Griliches (1986) and Cragg (1997).

CHAPTER 5 ♦ Large-Sample Properties 87 5.6.3 PROXY VARIABLES

In some situations, a variable in a model simply has no observable counterpart. Education, intelligence, ability, and like factors are perhaps the most common examples. In this instance, unless there is some observable indicator for the variable, the model will have to be treated in the framework of missing variables. Usually, however, such an indicator can be obtained; for the factors just given, years of schooling and test scores of various sorts are familiar examples. The usual treatment of such variables is in the measurement error framework. If, for example, income = + education + e and years of schooling = education + u, then the model of Section 5.6.1 applies. The only difference here is that the true variable in the model is "latent." No amount of improvement in reporting or measurement would bring the proxy closer to the variable for which it is proxying.

The preceding is a pessimistic assessment, perhaps more so than necessary. Consider a structural model,

Earnings = + Experience + Industry + Ability + e

Ability is unobserved, but suppose that an indicator, say IQ is. If we suppose that IQ is related to Ability through a relationship such as

IQ = a1 + a2 Ability + v then we may solve the second equation for Ability and insert it in the first to obtain the reduced form equation

Earnings = (fi1 - a1/a2) + Experience + Industry + (fi4/a2)IQ + (e - v/a2).

This equation is intrinsically linear and can be estimated by least squares. We do not have a consistent estimator of or , but we do have one of the coefficients of interest. This would appear to "solve" the problem. We should note the essential ingredients; we require that the indicator, IQ, not be related to the other variables in the model, and we also require that v not be correlated with any of the variables. In this instance, some of the parameters of the structural model are identified in terms of observable data. Note, though, that IQ is not a proxy variable, it is an indicator of the latent variable, Ability. This form of modeling has figured prominently in the education and educational psychology literature. Consider, in the preceding small model how one might proceed with not just a single indicator, but say with a battery of test scores, all of which are indicators of the same latent ability variable.

It is to be emphasized that a proxy variable is not an instrument (or the reverse). Thus, in the instrumental variables framework, it is implied that we do not regress y on Z to obtain the estimates. To take an extreme example, suppose that the full model was y = + e, X = X* + U,

That is, we happen to have two badly measured estimates of X*. The parameters of this model can be estimated without difficulty if W is uncorrelated with U and X*, but not by regressing y on Z. The instrumental variables technique is called for.

When the model contains a variable such as education or ability, the question that naturally arises is, If interest centers on the other coefficients in the model, why not just discard the problem variable?14 This method produces the familiar problem of an omitted variable, compounded by the least squares estimator in the full model being inconsistent anyway. Which estimator is worse? McCallum (1972) and Wickens (1972) show that the asymptotic bias (actually, degree of inconsistency) is worse if the proxy is omitted, even if it is a bad one (has a high proportion of measurement error). This proposition neglects, however, the precision of the estimates. Aigner (1974) analyzed this aspect of the problem and found, as might be expected, that it could go either way. He concluded, however, that "there is evidence to broadly support use of the proxy."

5.6.4 APPLICATION: INCOME AND EDUCATION AND A STUDY OF TWINS

The traditional model used in labor economics to study the effect of education on income is an equation of the form yi = + 02 agei + 03 age2 + 04 education + xi f 5 + Si, where yi is typically a wage or yearly income (perhaps in log form) and xi contains other variables, such as an indicator for sex, region of the country, and industry. The literature contains discussion of many possible problems in estimation of such an equation by least squares using measured data. Two of them are of interest here:

1. Although "education" is the variable that appears in the equation, the data available to researchers usually include only "years of schooling." This variable is a proxy for education, so an equation fit in this form will be tainted by this problem of measurement error. Perhaps surprisingly so, researchers also find that reported data on years of schooling are themselves subject to error, so there is a second source of measurement error. For the present, we will not consider the first (much more difficult) problem.

2. Other variables, such as "ability"—we denote these ^ — will also affect income and are surely correlated with education. If the earnings equation is estimated in the form shown above, then the estimates will be further biased by the absence of this "omitted variable." For reasons we will explore in Chapter 22, this bias has been called the selectivity effect in recent studies.

Simple cross-section studies will be considerably hampered by these problems. But, in a recent study, Ashenfelter and Krueger (1994) analyzed a data set that allowed them, with a few simple assumptions, to ameliorate these problems.

Annual "twins festivals" are held at many places in the United States. The largest is held in Twinsburg, Ohio. The authors interviewed about 500 individuals over the age of 18 at the August 1991 festival. Using pairs of twins as their observations enabled them to modify their model as follows: Let (yij, Aij) denote the earnings and age for

14This discussion applies to the measurement error and latent variable problems equally.

twin j, j = 1, -, for pair i. For the education variable, only self-reported "schooling" data, Sij, are available. The authors approached the measurement problem in the schooling variable, Sij, by asking each twin how much schooling they had and how much schooling their sibling had. Denote schooling reported by sibling m of sibling j by Sij(m). So, the self-reported years of schooling of twin 1 is Si1(1). When asked how much schooling twin 1 has, twin - reports Si 1(-). The measurement error model for the schooling variable is

Sij(m) = Sij + Uij(m), j, m = 1, -, where Sij = "true" schooling for twin j of pair i.

We assume that the two sources of measurement error, uij(m), are uncorrelated and have zero means. Now, consider a simple bivariate model such as the one in (5--5):

As we saw earlier, a least squares estimate of P using the reported data will be attenuated:

(Since there is no natural distinction between twin 1 and twin -, the assumption that the variances of the two measurement errors are equal is innocuous.) The factor q is sometimes called the reliability ratio. In this simple model, if the reliability ratio were known, then P could be consistently estimated. In fact, this construction of this model allows just that. Since the two measurement errors are uncorrelated,