Unbiased Analysis of Today's Healthcare Issues

Hausman Endogeneity Test

Written By: Jason Shafrin - Oct• 30•06

Ordinary Least Squares 

If you have studied basic statistics, its likely that you have come across the ordinary least squares (OLS) estimation technique.  OLS attempts to minimize the squared distance between dependent variables (‘y‘) and the a linear prediction of y (y_hat=xβ).  The parameter vector ‘β_ols‘ minimizes this distance.  The most important assumption in order for β to reflect to true parameters in the population is for the regressors to be uncorrelated with the error terms (cov(x,e)=0).  Sometimes this is not the case.  The assumption fails if:

  1. There are omitted variables which are correlated with the regressors (x)
  2. We have a system of simultaneous equations.
  3. There is an errors in variables problem
  4. The system has a lagged dependent variable with a serially correlated disturbance

Instrumental Variables

One solution to these problems is to use an Instrumental Variables (IV) technique.  (Click here for an explanation of IV).  A question remains as to when OLS is appropriate and when IV is best.  OLS will generally give smaller standard errors (and thus is more precise) and is to be preferred when the β_OLS parameters are unbiased.  

Hausman Endogeneity Test

To test whether the IV or OLS regression technique is best, one can use the Hausman endogeneity test.  Let us try to estimate the following equation:

  • (1) y1 = x1*δ + y2*α + e

Let the vector z=(x1,x2) be the set of all exogenous variables.  The vector x1 is the set of regressors and x2 are our instruments.  Since z is exogenous, we know E(z′*e)=0.  The variable y2, we believe to be endogenous. 

One example of an endogenous y2 would be a wage equation where y1 is the individual’s wage and y2 is the number of hours worked.  We would think that full time workers would earn more than part time workers so hours would affect wage.  On the other hand, when a worker’s wage is higher (assuming leisure is a normal good) one would expect the individual to work more hours.  In this example we have dual causation.

To conduct the Hausman test, we first find the linear projection of y2 on z using OLS.

  • (2) y2 = z*π + v

Since the error term from the first equation (‘e‘) is uncorrelated with z by assumption, then y2 is endogenous if and only if E(v*e)≠0.  We can test whether the structural error ‘e‘ is correlated with the reduced form error (‘v‘) using the following equation:

  • (3) e = Ï?*v + u

If we plug equation 3 into equation 1 and we have:

  • (4) y1 = x1*δ + y2*α + Ï?*v + u

In empirical data, however, ‘v‘ is not observed.  Nevertheless, we can estimate ‘v_hat‘ by taking the saved residuals from our OLS regression in equation 2 and plugging these numbers into equation 4 for ‘v’.  The final equation is:

  • (5) y1 = x1*δ + y2*α + Ï?*(v_hat) + u

We can now consistently estimate δ,α, and Ï? using OLS.  Using the usual OLS t-statistic, we can test the null hypothesis that Ï?=0.  If we accept the null, then there is no endogeneity problem and one should use an OLS estimation strategy.  If Ï?≠0, then the instrumental variables technique is best.  One can also use a heteroskedasticity-robust t-statistic for testing Ï? if one suspects heteroskedasticity. 

A similar set of procedures can be extended to the case where y2 is a vector.  Instead of an t-test on the residual ‘v_hat‘, in the vector case we would have to preform an F-test (Ï?=0) on a vector of residuals ‘v_hat‘.  To see how to preform Hausman tests in the Stata statistical package, look at this paper by Baum, et al.


  1. Preform first stage regression of the endogenous variable (y2) on z.
  2. Calculate the residuals from this equation and include them as an additional regressor in the original estimation equation.
  3. Run OLS on this new equation and preform a t-test for the coefficient on the first stage residuals.
  4. If one accepts the null hypothesis, then there is no endogeneity problem and OLS should be used.  If one rejects the null hypothesis, then endogeneity is a problem and one should use an IV estimation strategy.


Wooldridge, Jeffrey; Econometric Analysis of Cross Section and Panel Data, MIT Press, London, (c) 2002, pp. 118-122.  

You can follow any responses to this entry through the RSS 2.0 feed. Responses are currently closed, but you can trackback from your own site.

One Comment

  1. vb says:

    This is a really nice and clear explanation!