Econometrics

You are currently browsing the archive for the Econometrics category.

Nobel laureate James Heckman has a nice summary of how applied econometricians and policy researchers should define causality. Some of the more interesting points I have excerpted below.

On the source of randomness in a sample

One reason why many statistical models are incomplete is that they do not specify the sources of randomness generating variability among agents, i.e., they do not specify why otherwise observationally identical people make different choices and have different outcomes given the same choice. They do not distinguish what is in the agent’s information set from what is in the observing statistician’s information set, although the distinction is fundamental in justifying the properties of any estimator for solving selection and evaluation problems. They do not distinguish uncertainty from the point of view of the agent whose behavior is being analyzed from variability as analyzed by the observing analyst. They are also incomplete because they are recursive. They do not allow for simultaneity in choices of outcomes of treatment that are at the heart of game theory and models of social interactions and contagion (see, e.g., Brock & Durlauf, 2001; Tamer, 2003).

Unbundling a treatment

Researchers often say that a policy change will cause a change in some outcome measure. However, a policy change is often made up of many components. Which components of the policy change actually influenced the outcomes? In Heckman’s words:

Many causal models in statistics are black-box devices designed to investigate the impact of “treatments”—often complex packages of interventions—on observed outcomes in a given environment. Unbundling the components of complex treatments is rarely done. Explicit scientific models go into the black box to explore the mechanism(s) producing the effects.

Outcomes vs. Utilities

Most researchers pick an outcome variable of interest and if the outcome increases–assuming a beneficial outcome measure–than people are better off. This may not be the case however. For instance, Bill Clinton’s welfare reform act (PRWORA) may have increased employment rates and income for single mothers, but the mother’s utility may have decreased. The single mothers may (or may not) have valued spending time caring for their child more than working.

Problems with non-linearity

Issues such as “social interactions, contagion and general equilibrium effects” can complicate causal inference.

What are you measuring?

Let us assume that Y is the outcome variable of interest. Y depends on what state, s, you are in. For instance, in a treatment/no treatment world, Y(s) is the outcome if you would be treated and Y(s’) is the effect if you were not treated. D(s)=1 if you were actually treated and D(s)=0 if you did not receive treatment in the data. Thus, we can measure various things:

  • Average Treatment Effect (ATE): E (Y s) − Y(s’)). This is equal to the average effect if all individuals moved from a untreated to a treated state.
  • Treatment on the Treated (TT): E[(Y(s) − Y(s')) | D(s) = 1]. This looks at the average effect of treatment only on those who were treated. This is important if only certain individual select into the treatment group, or if the policy change is only relevant for certain individuals.
  • Treatment on the Untreated (TUT): E[(Y(s) − Y(s')) | D(s) = 0]. It is also possible that treatment can affect those who are not treated. For instance, instituting a work training program for treated individuals may reduce community college enrollment and thus may affect untreated individuals (e.g., if the community college closes from lack of enrollment).
  • Policy relevant treatment effect (PRTE):Ep[Y(s)] − Ep’ [Y(s)]. The estimator compares the average outcomes of two different policy choices.

Heckman, James (2008) “Economic Causality” NBER WP #13934.

Much of health care data is characterized by a large cluster of data at 0, and a right skewed distribution of the remaining outcomes. For instance, people who do not get sick generally use $0 of medical care. Those who do get sick, use a varying amount of medical care dollars, but there are a large number of outliers with extremely expensive medical care. How do health economists take these anomalies into account?

David Madden looks at two alternatives to correct for the shape of the distribution in his 2008 JHE paper: sample selection and two-part models. Zero consumption of medical can be caused from two different decisions: a participation decision and a consumption decision. For instance, in the case of smoking, individuals may decide not to smoke no matter how cheap cigarettes get (participation decision). On the other hand, some smokers may decide not to smoke during a given time period because cigarettes are very expensive or they have low income (consumption decision). Since people can not smoke negative cigarettes, there still may be a cluster of observations around zero.

Assume that individuals utility from participation is equal to w=α’Z + v. If w>0, then d=1, (the individual participates) and if w<0, then d=0, (the individual does not participate). For consumption, individuals will choose y**=max[0,y*]; y*= β’X + u. A general model can be written as follows:

  • L0 = Π0 [1-P(v>-α'Z) P(u>-β'X |v>-α'Z)] Π+ P(v>-α’Z) P(u > -β’X|v>-α’Z) g(y|v>-α’Z,u > -β’X)

If u and v are independent, then we have the Cragg model:

  • L10 [1-P(v>-α'Z) P(u>-β'X)] Π+ P(v>-α’Z) P(u > -β’X) g(y|u > -β’X)

If we assume that the participation constraint dominates the consumption constraint (which is likely in the smoking example, but maybe not for drinking), then we have P(y*>0|d=1)=1 and g(y*|y*>0,d=1)=g(y*|d=1). This means that if you are a smoker you will have at least one cigarette per period. When the participation constraint dominates, we ignore the consumption decision and we have the following likelihood function which corresponds to the Heckman Selection model.

  • L20 [1-P(v>-α’Z) Π+ P(v>-α’Z) g(y|v>-α’Z)

If independence is assumed, then we are left with probit for participation and OLS for consumption. This is the two part model:

  • L30 [1-P(v>-α’Z) Π+ P(v>-α’Z) g(y)

Which of these models works best empirically?

Results

Madden looks at the fit of regressions trying to model smoking and drinking behavior using a wide variety of covariates. In general, the two-part model seems to be perform better in the data used for this study, but the author wisely notes that deciding between the Heckman selection and the two-part model should be done on a case-by-case basis.

Let us assume that there are two types of people: smart people an dumb people. Smart people’s test scores are normally distributed about 80% and dumb people’s tests scores are normally distributed about 40% on their test. If we observe the test score of one person, how do we know if they are smart or dumb? If we see a score of 85%, we are pretty sure they are smart. A dumb person might have had a good day, but this would be a low probability event. Similarly, if we saw a score of 35%, we would be fairly certain that the person is dumb, even though there is a small probability that a smart person may have had a bad day. If we see a score of 62%, however, then it is very difficult to distinguish if the person is smart of dumb. But how can we quantify the probabilities that a person is of a certain type.

One way of doing this is finite mixture models. Jim Hamilton’s Time Series Analysis book has a good explanation of this topic and I will review this material here.

Each type (e.g.: how smart the person is) will be designated as st=1,2,…, or N. Let us assume that there is an observed variable yt (e.g.: the test score) which is distributed according to a N(μsj2). What researchers wants to know is that given that we observe yt, what is the probability that the observation is from a person of type st=j.

Let us assume that we know the density of yt is:

  • f(yt|st=j;θ)=(2πσj2)-1/2 * exp{-(yt - μj)/2σj2}

There is also some underlying distribution of types.

  • P(st=j;θ)=λj
  • θ=(μ1,…,μN1,…,σN1,…,λN)

From Bayes Rule, we know that:

  • P(A and B)=P(A|B)*P(B), which implies
  • f(yt,st=j;θ)=λj*(2πσj2)-1/2 * exp{-(yt - μj)/2σj2}

The unconditional density can be found as follows:

  • f(yt;θ)=Σ1 to N p(yt,st=j;θ)
  • f(yt;θ)=λ1*(2πσ12)-1/2 * exp{-(yt - μ1)/2σ12} +…+λN*(2πσN2)-1/2 * exp{-(yt - μN)/2σN2}

Now we can use maximum likelihood estimation techniques to find the θ which will maximize:

  • maxθ L(θ)=Σ1 to Tlog f(yt;θ)
  • s.t.: λ1 + λ2 +…+ λN=1
  • s.t: λj≥0

Once we have the MLE estimated θ, we can figure out what the probability is that observation yt came from a person of type st=j. Using Bayes theory, again, we know that:

  • P(st=j|yt;θ)=f(yt,st=j;θ)/f(yt;θ)=λj*f(yt|st=j;θ)/f(yt;θ)

This value represents the probabilty, given the observed data, that the unobserved type responsible for observation t was in of type j. For example, “…if an observation yt=0,, one could be vertually certain that the observation had come from a N(0,1) distribution rather than a N(4,1) distribution, so that P(st=1|yt;θ) for that date would be near unity. If instead yt were around 2.3, it is equally likely that the observation might have come from either regime so that P(st=1|yt;θ) for such an observation would be close to 0.5.”

Most of the above content came is from:

  • James D. Hamilton (1994) Time Series Analysis, Princeton University Press, Princeton, NJ; pp. 685-689.

What is the effect a country’s GDP on health? What about the country’s literacy rate on infant mortality rates? Often researchers try to answer these questions using time-series data. With time series data, we have observations of a few units (e.g.: countries or individuals) over many years.

Let the subscript i represent the the individual or country and the subscript t indicate the year. We can have a regression framework as follows:

  • yit = βxit + εit

As long as cov(xitit)=0, then ordinary least squares (OLS) will provide an unbiased estimate of β1.

One frequent problem which occurs with time series data is that there will be serial correlation. Serial correlation (or autocorrelation) occurs when the error terms are correlated over time. For instance,

  • εit=ρεit-1it

Serial correlation means that if your predicted y value is overestimated in period, it is likely to be overestimated in another period. This is likely due to some persistent variable omitted the regression. For instance, if we regressed test scores on a vector of explanatory variables, it is likely that student who scored higher than their predicted test score in one period would also score higher then their predicted test score in another period.

Fortunately, our coefficient vector (β) is still unbiased even in the presence of serial correlation. However, OLS is inefficient. In this case, the standard errors are too small.

One way to test for serially correlation is to use the Durbin-Watson test. Let uit be the fitted values of the error terms after we conduct and OLS regression (uit = yit - βols xit ).

The Durbin Watson statistic is:

  • d= [Σ(t=2 to T) (uit - uit-1)2] / [Σ(t=1 to T) (uit)2]

With panel data we have:

  • d= [Σ(i=1 to N)Σ(t=2 to T) (uit - uit-1)2] / [Σ(i=1 to N)Σ(t=1 to T) (uit)2]

This page will help you interpret the statistic as to whether or not you should accept or reject serial correlation. If there is serial correlation in your data, you may want to include a lagged dependent variable as one of your right hand side variables. This will result in an AR(1) specification.
Yuting Wang of Notre Dame has a good explanation of the problems that occur with serial correlation.

Most public health officials believe that increasing the supply of primary care doctors is almost always a good thing, while increasing the number of specialists can have mixed results. One problem is that physician supply is endogenous. One may believe that physicians prefer to locate in wealthier areas. If wealthier people are also healthier, then a correlation will exist between physician supply and health even though no causality exists.

In order to isolate the direct causal effect of increasing family physician supply, Gravelle, Morris and Sutton (2008) use an instrumental methods methodology. The two instruments for physician supply are: an index of local area housing prices and average age-related capitation payments. Since physicians location decisions are regulated by the Medical Practices Committee and do not include a cost-of-living adjustment, we would expect lower physician supply where there housing prices are higher. Local area average capitation payments should not effect any individual’s health, but should attract increased family physician supply.

These instruments are implemented on the Health Survey of England data set. Physician supply comes from the General Medical Services (GMS) Statistics database.

Health levels are either measured as very good, good, fair, bad, or very bad. In this case, an ordered probit regression is used. The authors also utilized the EQ-5D continuous scale health measure. With the continuous variable, a least squares regression model is used. What are the results?

When no instruments are used FPs [family physicians] have a positive but statistically insignificant effect on health. When FP supply is instrumented by age-related capitation it has markedly larger and statistically significant effects. A 10 percent increase in FP supply increases the probability of reporting very good health by 6 percent.

Since almost all medical care and pharmaceuticals are free to patients, increased physician supply will not act to reduce prices. Nevertheless, more family physicians can make going to the doctor more convenient and can reduce waiting times, thus increasing the number of family physician visits per individual per year.

One interesting econometric technique used in this paper is that of the anti-test. A paper by Dranove and Meher (1994) criticizes the use of instrumental variables because the use of some instruments can be used to “prove” that increased physician supply “causes” increased childbirth. This is obviously a nonsensical correlation. In this paper, the authors use instrumented and noninstrumented family physician supply to see these variables have any effect on the individual’s ethnicity. Neither the instrumented or noninstrumented physician supply has any impact on ethnicity. Thus, we have some indication that the two instruments chosen by the authors are valid.

Randomized clinical trials (RCTs) are the “gold standard” for medical studies. Nevertheless, even RCTs have their problems. An NBER working paper by Ludwig, Marcotte and Norberg points highlights some of these issues. The authors examine whether or not anti-depressants reduce suicide rates (they find that anti-depressants do reduce suicide rates).

Unfortunately, using data from RCTs will not give an accurate picture of an anti-depressant’s impact on suicide. For one, RCTs have relatively small sample sizes due to their expense. Since suicide occurs very infrequently, it will be difficult to pick up an statistically significant differences in suicide rates between the treatment and control groups. Secondly, people at high risk for suicide will likely be excluded from the RCT for ethical reasons. Thus, the RCT may have a sample which will under-represent individuals with suicidal tendencies.

Traditional instrumental variables (IV) econometric methodologies often fail to take into account response heterogeneity. Response heterogeneity based on characteristics not observed by the researcher can create a heterogeneity in the self-selection process. For instance, one group of people who elect to receive surgery may have knowledge of a family history where surgery is typically successful, whereas another group may elect not to receive surgery due to a different family history. If this information is unobservable to the researcher than an analysis of the average of effect of surgery may be biased. In the medical context, traditional IV assumes that:

  1. treatment effects are constant conditional on observed characteristics, or
  2. if treatment effects are heterogeneous, patients or physicians cannot anticipate these effects and use this information to select the most beneficial treatment.

In traditional IV, the treatment parameter gives researchers a local average treatment effect (LATE). But can a researcher characterize a heterogeneous response using IV? A solution to this problem is presented by Basu, Heckman, Navarro-Lozano and Urzua in a 2007 Health Economics paper. They use a local IV to estimate marginal treatment effect (MTE) parameters.

Basic Econometrics Review

Let us assume that a person will have two different outcomes based on whether or not they are treated:

  • Y1 = μ1(X) + U1
  • Y0 = μ0(X) + U0
  • Δ = Y1 - Y0 = {μ1(X) -μ0(X)} + (U1 - U0)
  • Y=μ0(X)+D*{μ1(X)-μ0(X)} + {D(U1 - U0) + U0}

The variable Y1 represents the outcome if the person is treated and Y0 represents the outcome if they are not treated. We only have one observation per person, however, since we cannot observe the counterfactual. If we could observe the counterfactual, Δ would give us the effect of the treatment for each person. Unfortunately we only observe Y. The dummy variable D is equal to unity if the person is in the treatment group and zero otherwise. If there were a randomized trial where people are randomly placed into the treatment and control groups, it would be easy to estimate the treatment effect by comparing the mean outcomes of the treated and control groups. We could examine the mean outcomes for individuals with similar characteristics to determine the treatment parameter by subgroup. However, if individuals can select whether or not to be treated, the error term–which may be composed of unobserved heterogeneity in the effectiveness of the treatment–may be correlated with the regressors that impact the outcome.

The traditional solution to the endogeneity problem is IV. Let X be the set of regressors and Z represent the instruments. “LATE computes the mean gain to those induced to switch from no treatment to treatment by a change in Z from z to z‘.”

  • LATE={E(Y|X=x, Z=z‘)-E(Y|X=x, Z=z)} / {P(D=1| X=x, Z=z‘) - P(D=1| X=x, Z=z)}

Marginal Treatment Effect (MTE)

Developed by Björklund and Moffitt (1987) and furthered by Heckman (1997), the MTE measures “the average gain to patients who are indifferent between receiving treatment 1 [the treatment] versus treatment 0 [the control] given X and Z.” The benefit of using MTE is that one can calculate the marginal treatment effect for different subgroups based on the propensity score. This places a high degree of reliance on the accuracy and precision of the propensity score in order to determine these subgroup treatment parameters.

Let V denote a latent variable which measures the difference in benefits from being in the treated and control groups. Treatment choice can be modeled as follows.

  • V= μv(Z,X) + Uv
  • E(Uv)=0
  • D=1(V>0)

The authors use a propensity score to determine the probability of selecting treatment.

  • P(z,x)=P(D=1|Z=z, X=x) = P(Uv > -μv(z,x)) = 1 - FUv(-μv(z,x))
  • FUv() is the cdf of Uv.

Now we can define MTE to be:

  • MTE(x,z)=E(Δ|X=x, Z=z, V=0)
  • =E(Δ|X=x, Z=z, Uv=-μv(z,x))
  • 1(x) -μ0(x) + E{U1 - U0|Uv=-μv(z,x)}
  • 1(x) -μ0(x) + E{U1 - U0|UD= FUv(-μv(z,x))}

where FUv(Uv)=UD. The last equation after the ‘|’ is a monotonic transformation of the terms after the ‘|’ in the third equation.

Local IV (LIV)

The LIV estimates the derivative of the expected outcome conditional on observed characteristics and the probability of electing to be in the treatment group, E(Y|X=x, P(z,x)), with respect to the probability of treatment, P(z,x). The term E(Y|X=x, P(z,x)) is defined as follows:

  • E(Y|X=x, P(z,x))=E{ DY1 - (1-D)Y0 |X=x, P(Z,X)=P(z,x)}
  • 0(x) + P(z,x){μ1(x) -μ0(x)} + E{U0|P(Z,X)=P(z,x)} + P(z,x){E{U1-U0 | P(Z,X)=P(z,x), D=1)
  • 0(x) + P(z,x){μ1(x) -μ0(x)} + K{P(z,x))

The term K(P(z,x)) is a general function of the propensity score, P(z,x). Often, K() will be a polynomial of the propensity score. The MTE can be computed mathematically as below:

  • {∂E(Y|X=x, P(z,x)) / ∂P(z,x)} |1-P(x,z)=UD
  • = μ1(x) -μ0(x) + ∂K(P(z,x))/∂P(z,x)

The equation above “…is implemented by regressing the outcome Y on all covariates [X], the propensity score, the interaction of the propensity score with all covariates and a polynomial on the propensity score.” This procedure is carried out in the paper empirically by applying these methods to data on breast cancer patients and their choice of breast-conserving surgery with radiation compared to mastectomy.

Can we estimate risk aversion and prudence using a survey question for the general public? This is what a paper by Eisenhauer and Ventura attempts to do.

Methods

In the 1995 Survey of Italian Households’ Income and Wealth, one question asked:

You are offered the opportunity of acquiring a security permitting you, with the same probabilities, either to gain 10 million lire [5165€] or to lose all the capital invested. What is the most you are prepared to pay for this security?

Assuming, the respondents answer honestly and precisely (which is a big assumption to make), the authors can create and individual’s utility function:

  • U(w)=0.5U(w-z)+0.5*U(w-z+10)

The variable w represents initial wealth and z is the amount individual would pay for a security. Using a Taylor expansion, we can create an estimate of absolute risk aversion.

  • 2U(w)=U(w)-zU’(w)+0.5z2U”(w) + (10-z)U’(w) + .5(10-z)2U”(w), or
  • [(50-10z+z2)/(10-2z)]*U”(w)=-U’(w)
  • A(w)=[(10-2z)/(50-10z+z2)]
  • R(w)=A(w)*w

The term A(w) represents the Arrow-Pratt measure of absolute risk aversion while R(w) is equal to relative risk aversion. If we differentiate the second equation above with respect to initial income, w, we can calculate a measure of prudence (-U”’/U”).

  • η(w)=A(w) + {(10-z)-1 + [2z/(100+z2)]}*∂z/∂w
  • ?(w)=w*η(w)

The term η(w) measures absolute prudence while ?(w) measures relative prudence.

Results

Since the authors have information regarding each individual’s initial earnings and various sociodemographic factors, they can analyze which type of people are risk averse.

  • Relative risk aversion is between 7.18 and 8.59.
  • Relative prudence is between 7.32 and 8.65.
  • The most risk averse groups are those in poor health and those with only an elementary school education.
  • The least risk averse are the college educated and those with health insurance.
  • Those with risk assets such as stocks or loans are less risk averse.
  • The authors claim that generally R(w)<?(w)<R(w)+1 and risk aversion and prudence are highly correlated.

Healthcare Economist critique

Finding that people are risk averse and prudent is unsurprising, but the levels of risk aversion and prudence are very high compared to other studies. While having a vast array of sociodemographic information is important, simply eliciting a willingness to pay for a risky gamble is likely not a precise estimate of risk aversion. Likely, most people will respond to the question categorically (5 million lire, 4.5 million lire, 4 million lire, etc.). Further, finding that people with health insurance are less risk averse is counter-intuitive. One explanation is that having health insurance may be a proxy for wealth. Thus people with heath insurance in general could be more risk averse, but since this group of people is also richer (and more affluent people are generally less risk averse) we could have opposing effects.

Today I will review the insightful lecture of Willard Manning at European Science Days. Manning is most famous for his work with the RAND Health Insurance Experiment.

Problems with Healthcare Data

There are 4 major econometric problems one must consider when trying to analyze health care cost and utilization data:

  1. There is a large mass of individuals with zero utilization (or expenditures) during a given time period,
  2. Consumption among those with any care is very skewed (e.g.: visits, hospitalizations, expenditures),
  3. The dependent variable often responds in a non-linear manner to many covariates,
  4. demand response to covariates may change by the level of demand (e.g.: outpatient to inpatient, or low to high levels)

Log or Box-Cox Transformations

While using OLS is easy, it can often produce out-of-range predictions (i.e.: yhat=xβhat<0). Since health care data is skewed, many researchers decide to log the dependent variable in order to have a more symetric distribution of errors. The tradeoff of using logs is that although one gains precision and robustness, no one is interested in log-scale results per se.

The Box-Cox transformation of y is as follows:

  • [(yλ-1)/λ]=xβ+ε, if λ≠0
  • log(y)=xβ+ε, if λ=0

One estimates λ using MLE in order to minimize the skewness in the residuals.

Log Example

Using a log transformation implies that second moments often matter. For instance, let us assume log(y|g)~N(μgg), where treatment g=A, B. Then we know

  • E(y|g=A) = exp[μa+ 0.5(σa)2].
  • E(y|g=A)/E(y|g=B) = exp[(μab)+ 0.5{(σa)2-(σb)2}]

We can see from the second equation above, that the second moment of the distributions matters if there is heteroskedasticity, but not if there is homoskedasticity (i.e.: σab=σ)

Marginal Effects with log transformation

Calculating marginal effects with non-linear econometric formulations is often difficult.  For instance, we know that E(y)= exp(xβ)E{exp(ε)|x}. This implies that the marginal effect is equal to:

  • dE(y)/d(xk)=exp(xβ)[βkE{exp(ε)|x}+ d E{exp(ε)|x}/d(xk)]

This is much more complicated that the incorrect formulation that: dE(y)/d(xk)=exp(xβ)βk.

Generalized Linear Model Approach

In this method, one searches for the appropriate β’s to solve the following function:

  • Σ dμ(xβ)/dβ*V(x)-1*(y-μ(xβ))=0

In practice, one usually assumes that μ(xβ)=exp[xβ]. A variance structure is assumed so that Var(y|x)=α[E(y|x)]γ. The γ’s correspond to some standard parametric distributions:

  • Gaussian NLS: γ=0
  • Poisson: γ=1
  • Gamma: γ=2
  • Wald or inverse Gamma: γ=3.

Two Part Models

To this point, we have been focusing on the skewness problem and been ignoring the fact that many of the observations also clump at zero. We can decompose the expected value as follows:

  • E(y|x) = P(y>0)*E{y|y>0} + P(y=0)*0 = P(y>0)*E{y|y>0}

Now we must estimate P(y>0) and E(y|y>0) separately. The first part term we can estimate with a probit model [P(y>0)=Φ(xα). The second part one can log the y term to take into account skewness.

If the log-scale error term is normally distributed, then:

  • yhat= Φ(xα)*exp(xβ + .5σ2), where β, σ are estimated from the data.

If the log-scale error term is not normally distributed, than one can use the following formulation:

  • yhat= Φ(xα)*exp(xβ)*D
  • D is Duan's (JASA 1983) smearing estimator:
  • D=N-1Σexp[ε]=N-1Σexp[ln(y|y>0)-xβols]

Count Data

Count data in health economics is very common. The number of doctor visits, hospitalizations and ER visits all are types of count data. Poisson and Negative Binomial regressions are frequently recommended for these types of data.

The Nursing Home Compare website provides consumers with quality ratings of thousands of nursing homes (NHs) around the country. Are these ratings accurate? Could they be improved?

This is the question which researchers Arling, Lewis, Kane, Mueller and Flood analyze in their 2007 HSR paper. The authors find 2 major flaws with the rankings: 1) there is weak risk-adjustment and thus the ratings do not fully take into account the underlying characteristics of the population being served by the NH, and 2) there are no precision measures included in the rankings.

In order to improve the rankings, the authors use an empirical Bayesian (EB) shrinkage model with risk adjustment.

In the empirical Bayesian model, an empirical distribution serves as the prior. When new data are collected, these serve as the “Likelihood” or posterior distribution. Confidence intervals are constructed around the EB estimates from the posterior distribution. In this paper, the authors have data at both the resident and facility level. The prior distribution is estimated from using the total nursing home resident population. The posterior distribution is based on facility level data and the Likelihood function is the product of the two distributions. The authors explain in more detail:

“The influence of the facility’s observed QM [quality measure] rate on the posterior estimate will depend on the size of the facility and the amount of QM variation within and between facilities. The QM rates in larger facilities will be more certain (e.g., have lower standard errors) than in smaller facilities and, thus, will have greater weight or influence on the overall posterior (EB) estimate. Also, QMs with less variation between facilities have a more certain empirical prior (population average QM rate), which then has a greater influence on the posterior. As the prior tends to pull the posterior estimate toward the population mean, EB estimates are referred to as ’shrinkage’ estimates. “

Using the EB methodology, the standard deviations for most QMs “decreased considerably.” Smaller facilities experienced more shrinkage towards the mean due to their small number of residents. This is logical since one outlier patient would have a much higher impact on average QM rankings in a NH with 10 residents than another facility with 100 residents.

The risk adjustment is calculated in three ways: 1) simply excluding the sickest patients (i.e.: those with end-stage diseases or are in a coma), 2) group the sample in different risk strata, and 3) use a logistic regression to estimate a risk adjustment factor for each patient. Each of the risk adjustment methods was found to have a strong effect on the rankings.

One problem the authors acknowledge is that using EB and risk adjustment may let some facilities ‘off the hook.’ Small facilities with sicker than average patients may have low QM score because of an unlucky spate of ill patients or they may truly be poor facilities. Bayesian shrinkage moves their scores closer to the mean, so these facilities’ QM ratings are less responsive to quality improvements or backslides than larger facilities.

One of the biggest advances statistical modeling in the last 30 years has been the use of the bootstrap. For those interested in learning about the bootstrap in more detail, a good place to start is an article by UCSD math professor Dimitris N. Politis which I will summarize here. For more detailed information, one may want to look at An Introduction to the Bootstrap by Efron and Tibshirani.

Set-up

Suppose we have n observation of a random variable X. We can group these as a vectors so that X=(X1,…,XN), where each Xi are iid with distribution F. If we want to estimate a parameter θ(F) from the data, we can use a statistic T(X) as an approximation. If we assume that F~Normal, we can use traditional statistics to estimate T(X) as well as the confidence interval around θ(F). If we do not know the distribution of F (which a researcher problem does not in reality), then classical statical theory may be less reliable and a bootstrap methodology may be more robust. Bootstrapping methodology allows the researcher to better estimate F, especially if there is significant skewness to the F distribution.

The bootstrap procedure creates a new sample, by randomly sampling each observation in X with replacement until we have a new vector with N observations. We repeat this B times to create our bootstrap data set. Let’s look at an example..

Example

Pretend we have data on how many push up I have completed each day over a week. I want to estimate the median number of push-ups I do each day. In this sample, N=15 and since we will create ten bootstrap samples, B=10.

Obs. Data B1 B2 B3 B4 B5 B6 B7 B8 B9 B10
1 22 18 25 29 21 21 22 18 31 24 14
2 18 24 14 25 14 21 19 35 25 21 19
3 14 25 24 31 25 21 21 14 21 30 21
4 35 19 18 26 19 25 19 31 24 24 14
5 22 29 29 31 26 30 26 21 22 19 26
6 24 31 24 31 22 30 19 30 31 26 19
7 26 25 19 22 21 25 25 26 22 30 18
8 29 30 22 14 22 22 19 18 31 35 29
9 19 31 21 14 14 21 14 26 18 22 18
10 31 25 24 35 29 22 19 14 31 26 25
11 30 22 25 22 29 14 19 35 19 22 22
12 19 22 24 19 18 35 29 26 21 19 35
13 22 22 22 24 25 24 30 19 35 25 29
14 21 31 25 22 25 14 14 22 31 19 18
15 25 22 19 30 22 35 24 19 19 31 26
Mean 23.8 25.1 22.3 25 22.1 24 21.3 23.6 25.4 24.9 22.2
Median 22 25 24 25 22 22 19 22 24 24 21

The median of the actual data we have is 22. But we can also calculate the median using a bootstrap methodology. We first randomly choose one of the data points and put it as the first data point of B1 (the bootstrap sample number 1), we then resample with replacement and put another number as the 2nd observation of sample B1. We can see that data points often repeat. For instance in B1 observations X10 repeats twice. We see that the median varies across the 10 bootstrapping samples, but the average value for the median using the bootstrap methodology is 22.8.

We can also calculate the the bootstrap variance (3.36) and standard deviation (1.83). This are calculated according to the formulas:

  • Variance: B-1ΣiT(X*i)2 - [B-1ΣiT(X*i)]2
  • S.D. = (Var)1/2

Here, T(X*i) is the median for each bootstrap sample i. Since there are 10 bootstrap samples i=1,…,10. To calculate the variance, one simply averages the squared median over the 10 bootstrap samples and then you subtract the squared average median of the 10 samples.

Let us pretend you have a system of M equations, with N observations for each equation. For example, if we are estimating supply and demand independently over 20 years, M=2 and N=20.

If each of the regressors is predetermined in each equation and we have an exclusion restriction, we can use the Seemingly Unrelated Regressions (SUR) methodology to improve the efficiency of the estimates. SUR is simply computing the generalized least squares (GLS) estimate in the multivariate case. A more detailed explanation is given here.

An example is the following:

  • PTS = α0 + α1EXP + α2MIN + u
  • REB = β0 + β1EXP + β2MIN + β3HT + v

Let us assume each basketball player’s points are a function of only their years of experience (EXP), the number of minutes they play per game (MIN) and a constant. The number of rebounds they get per game is also a function of a constant, EXP, and MIN but the person’s height (HT) also affects their rebounding totals. This system of equations would be the same as:

  • PTS = α0 + α1EXP + α2MIN + α3HT + u
  • REB = β0 + β1EXP + β2MIN + β3HT + v

where α3 was constrained to be 0. The fact that α3 is constrained to be 0 is our exclusion restriction. SUR uses a typically instrumental variables approach but our vector of instruments, z, is equal to the union of the regressors from all equations.

  • z=union of (x1,…xM)

In this example, M=2 so: x1=(1, EXP, MIN)’; x2=(1, EXP, MIN, HT)’; z=(1, EXP, MIN, HT)’. Our orthogonality conditions are that E(zu)=0 and E(zv)=0. Our parameter estimates become:

  • δ= …[σ11A11 , σ12A12 ]-111c11 , σ12c12 ]
  • ……..[σ21A21 , σ22A22]….[σ21c21 , σ22c22]

Amh=n-1Σi ximxih.

cmh=n-1Σi ximyih.

OLS can also be used because the regressors are predetermined. In fact, if each equation is just identified, SUR is mathematically equivalent to OLS. If at least one equation is overidentified—which would be the case in the first (PTS) equation in our example—then SUR is more efficient than equation-by-equation OLS.

For more information of Seemingly Unrelated Regressions, see Hayashi (2000) Econometrics, pp. 279-283.

One estimation procedure preformed by many novice economists is to use OLS to regress quantity on price. Let us assume the following framework (omitting the i subscripts on the variables):

  • qd = α0 + α1p + u
  • qs = β0 + β1p + v
  • qd = qs

If we regress qd on a constant and p in order to try to estimate the demand equation for some good, the OLS estimate of α1 is given by the formula α1OLS =Cov(p,q)/Var(p). I solve Cov(p,q) below:

  • Cov(p,q)=Cov(p, α0 + α1p + u)
  • = E(α0p + α1p2 + pu) - E(p)*E[α0 + α1p + u]
  • = α1Var(p) + Cov(p,u) [1]

To find Cov(p,u) we can solve the first system of equations above.

  • p= [(α0 - β0) + (u - v)]/(β1 - α1)
  • Cov(p,u)= Var(u)/(β1 - α1) [2]

So, substituting [2] into [1], we have:

  • Cov(p,q)= α1Var(p) + Var(u)/(β1 - α1)

Thus, our bias term for the OLS regression is:

  • Cov(p,q)/Var(p) - α1 = Cov(p,u)/Var(p) [3]

Since we see in equation [2] that Cov(p,u) is not equal to 0 unless Var(u) = 0—which is unlikely—we know the OLS estimate is biased. This phenomenon is known as simultaneous equation bias or endogeneity bias. The problem is that the error term (u) is correlated with the independent variable (p). The main way to solve this problem is to use an instrumental variables methodology.

If you think creating a survey which will compel respondents to answer in an unbiased manner is easy, check out this article originally published in the Wall Street Journal in February (”Census 2010 plays six not-so-easy questions“). The six questions proposed to be asked in 2010 Census short-form questionnaire are as follows:

  1. Name of person
  2. How is this person related to Person 1*? [Person 1 is defined to be the head of household]
  3. What is this person’s sex?
  4. What is this person’s age and what is this person’s date of birth?
  5. Is this person of Hispanic, Latino or Spanish origin?
  6. What is the person’s race?

These seems pretty self explanatory, right? Well the questions are not as clear as they seem. Examples of problems from each category are below.

  1. Name: This field can be confusing for migrants. Chinese names are written with the surname name first and the given name last (e.g.: Yao Ming should be formally addressed as Mr. Yao). Latin-American immigrants typically have two Spanish surnames, one from the father’s family name and one from the mother’s family name.
  2. Relationship: Respondents can choose among 14 possible answers regarding their relationship to the head of household, but a 15th answer–foster child–has been deleted since the 2000 census. How are these poor foster kids going to respond to the 2010 census?
  3. Sex: While this field seems the most self-explanatory, in the 2000 census 0.05% of respondents (or 150,000 of 300 million Americans) checked both the male and female boxes.
  4. Age: According to the WSJ, “Question No. 4 asks age — and for a computer double-check, date of birth — because so many people seem to get it wrong. Adding instructions to ‘report babies as age 0′ when they’re less than a year old, offends some people, census research suggests. But in the 2005 trial it improved the response rate among people who otherwise couldn’t decide how to answer for a six-month old.”
  5. Latino: (see “Race” below)
  6. Race: Again from the WSJ, “But in trial tests, the Census Bureau also found that Asian and Hispanic immigrants could be baffled when asked to lump themselves with other nationality groups. ‘The whole concept of being Latino is a very American construct,’ says Mr. Vargas. ‘People might not know what’s being asked of them.’ Under a 2005 order from Congress, question No. 6 also allows people to call themselves ’some other race’ and identify that race on a fill-in line. In census tests, respondents declared themselves Creole, Aryan, rainbow and cosmopolitan, among others. Other federal data users, like Social Security and the federal Education Department, don’t recognize those races, though. So in data that the Census Bureau will send to those departments, the bureau will impute a race. ‘Maybe I get it right and maybe I get it wrong. It’s not something I like to do,’ says Mr. Waite.”

To sum up, designing a good survey instrument is harder than you think.

Today we will look at some common distributions used for Bayesian inference.

Beta

The first distribution we will look at is the Beta distribution. The beta distribution is equal to: [B(a,b)]-1πa-1(1-π)b-1. We can show that:

  • If the prior ~ πa(1-π)b
  • And likelihood ~ πS(1-π)F
  • Then the posterior ~ πa+S(1-π)b+F

Where ‘~’ denotes equal except for a constant or proportional to. If:

  • S* = a + S + 1
  • F* = b + F + 1

Then

  • n* = S* + F*
  • P* = S*/n*

We can use P* and n* just like we would p and n in a classical binomial framework. P* is our expected mean and the Bayesian confidence intervals can be calculated as follows.

  • Conf. Int.: P* +/- (tα/2)*[P*(1-P*)/n*]1/2

Normal Distribution

With the normal distribution, we will again have a prior normal distribution and a likelihood function. The question is, which should we rely on more in creating our posterior distribution: our prior assumptions or the data collected.

Let the variance of the prior be σ20 and the variance of the sample be σ2. Also let μ0 be the prior mean and X* be the sample mean. If ‘n‘ is the number of observations in the sample, we can calculate the posterior mean as:

  • Posterior mean = (n0μ0+ nX*)/(n0 + n)
  • where: n0 = (σ2)/(σ20)

We can see that the prior mean is more important when the prior’s variance is small relative to the sample variance. On the other hand, the sample mean is given more wieght when the sample variance is small relative to the prior variance. We can calculate the posterior standard error as:

  • Posterior S.E. = σ/(n0 + n)1/2

Now we will give an example of Bayesian inference in a more complicating setting. This example is based on a problem from pp. 588-591 of Introductory Statistics for Business and Economics by Wonnacott and Wonnacott.

Let us assume that there is a consumer electronics company named Banana, inc.. Banana sells iPood mp3 players. Banana, inc., however, has quality control problems and some of the truckloads of iPoods are defective. The proportion of defective iPoods in each truckload is as follows:

Prior distribution of π
% Defective Nbr. of Shipments % of Shipments
(1) (2) (3)
0% 2 1%
10% 30 15%
20% 40 20%
30% 42 21%
40% 34 17%
50% 26 13%
60% 16 8%
70% 8 4%
80% 2 1%
90% 0 0%
100% 0 0%
  200 100%
     

CircuitVillage recieves a truckload of iPoods from Banana, inc. They decide to take a random sample of n=5 iPoods out of the truckload in order to get sample evidence on π (the proportion defective in this truckload). CircuitVillage finds that 3 of the 5 iPoods are defective. What is the posterior distribution of π?

We can calculate the likelihood function using the binomial distribution. The binomial probability function is as follows:

P(k out of n) =
n!

k!(n-k)!
(pk)((1-p)n-k)

We know that k=3 and n=5. And thus we can find the liklihood function that p=π.

Calc. to obtain posterior dist.
  Likelihood of Pi Prior x Likelihood Posterior
(1) (4) (5) (6)
0% 0.000 0.000 0.000
10% 0.008 0.001 0.008
20% 0.051 0.010 0.064
30% 0.132 0.028 0.172
40% 0.230 0.039 0.243
50% 0.313 0.041 0.252
60% 0.346 0.028 0.172
70% 0.309 0.012 0.077
80% 0.205 0.002 0.013
90% 0.073 0.000 0.000
100% 0.000 0.000 0.000
    0.161 1.000

Column 4 is found by simply plugging the first column value in for p into the binomial probability function where k=3 and n=5. Column 5 is found by multiplying column (3) by column (4). To normalize the distribution so that the probabilities sum to 1, we must divide by the sum of column five (0.161) and thus we have the posterior distribution.

We can ask ourselves what the probability is that less than 25% of the iPoods are defective in the shipment are defective. According to our prior, we would believe that 36% (.01 + .15 + .20) of the truckloads contain iPoods where less than 25% of them are defective. After collecting more information and observing that 3 of the 5 iPoods sampled are defective, our posterior distribution now says that it is less likely that the iPood shipment has a low defect rate. In fact, there is only a 7% chance (0 + 0.01 + 0.06) chance that less than 25% of the iPoods from Banana, inc. are defective according to our posterior.

Bayesian Inference is an important econometric tool. Over the next few days, we will review some of the basic Bayesian inference methods.

Economicitis occurs in 300 out of every 100,000 adults. Recently, however, a test has been developed to screen for the disease. Of 1000 individuals with economicitis who were tested, only 40 had an erroneous negative test. Out of 1000 healthy individuals, 20 out of 1000 individuals had an erroneous positive test result.

My friend Ron received the sad news that his test result shows that he has economicitis. Ron wants to know that given the test result is positive, what is the actual chance that he has the disease.

One way to estimate this is using Bayesian inference. According to Bayesian theory:

  • Posterior Odds = prior odds x likelihood ratio
  • Posterior Odds=p(θ1)/p(θ2) * [p(X11)/p(X12)]

The prior odds are having the disease are 300/99,700. This is equivalent to the prior probability Ron has the disease (300/100,000) divided by the prior probability he does not have the disease (99,700/100,000). The likelihood ratio is equal to the probability of having a positive test given the person has the disease (1-40/1000) divided by the probability of having a positive test given that the person is healthy (20/1000). Thus we have:

  • Posterior Odds = (300/99,700) x [(960/1000)/(20/1000)] = .144

This means that the chance the individuals who test positive for economicitis actually have the disease is about one in seven. To calculate the posterior probability, simply use the following formula:

  • Posterior Probability = (posterior odds)/(1 + posterior odds)=.144/1.144=12.6%

Thus, Ron should not be too worried about having the disease. Using the prior, Ron only had a 0.3% change of having the disease, but even after having tested positive for economicitis, Ron still only has a 12.6% chance of being stricken by this deadly disease.

Today I will review a few basic concepts of time series econometrics. A time series is a stochastic process where observations appear in different time periods. For instance, {zi} (i=1,2,3,…) is a stochastic process with zi representing the GDP each quarter. Below are a few important definitions which are important to econometric estimation using time series data.

  • Covariance Stationary Processes. A process is covariance stationary if i) E(zi) does not depend on i, and ii) Cov(zi,zi-j) exists, is finite, and depends only on j but not on i.
  • White Noise. A covariance stationary process, {zi}, is white noise if E(zi)=0 and Cov(zi,zi-j)=0 for j0. We typically assume that the error term in most estimating equations is white noise.
  • Ergodicity. A stationary process is said to be ergodic if:
    • limn->∞|E[f(zi,...zi+k)g(zi+n,...zi+n+l)]| =|E[f(zi,...zi+k)||E[g(zi+n,...zi+n+l)]|
    • This means that as the observations from time series become further and further apart, they become independent.

    • E(xi|zi-1,zi-2,zi-3,…z1)=xi-1.
    • This means that given all the information from the past, our best guess at the value of x in this time period is the value of x last time period. For instance, Hall’s Martingale Hypothesis states that given a variety of macroeconomic variables, my best guess of aggregate consumption this quarter is equal to aggregate consumption last quarter.

    • z1=g1
    • z2=g1+g2
    • zi=g1+g2+…+gi
  • Martingale. Let xi be an element zi. The scalar process {xi} is a martingale with respect to {zi} if:
  • Random Walk. A random walk is a specific type of martingale made up of the sum of a white noise process. Let {gi} be a white noise process. Then a random walk process {zi} would equal the following:

For further information, see: Hayashi, Fumio (2000) Econometrics. Princeton University Press. USA.

One of the basic concepts in statistics is the use mathematically rigorous tests to determine whether or not a researcher can reject their null hypothesis. The null hypothesis is the state of the world the researcher assumes exists. The alternative hypothesis is—as the name suggests—an alternative to the null hypothesis. Through these statistical tests, researchers try find the truth regarding a certain phenomenon. The degree of certainty the investigator has in his or her conclusion depends on the amount of type I and type II error in their calculations. Type I error occurs when the null is incorrectly rejected; type II error occurs when we fail to reject the null, when in fact the alternative is true. Below are more concrete examples of type I and type II errors.

Criminal Justice

In the criminal justice system, defendants are assumed to be innocent until proven guilty. Thus, the null hypothesis is that the individual is innocent while the alternative is that the defendant actually committed the crime. A type I error would occur if the individual was convicted of a crime they didn’t commit. A type II would presents itself when a guilty man is set free.

Clinical Drug Trial

In the case of a clinical test for a new pharmaceutical, the null hypothesis would be that a new drug (drug N) is no better than the current drug (drug O). On the other hand, the alternative hypothesis would state that drug N is superior to drug O. A type I error would conclude that the new drug is better than the drug O, when in fact it is not. A type II error would conclude that the new and old pharmaceuticals are equivalent when in fact the drug N is superior.

Ordinary Least Squares 

If you have studied basic statistics, its likely that you have come across the ordinary least squares (OLS) estimation technique.  OLS attempts to minimize the squared distance between dependent variables (’y‘) and the a linear prediction of y (y_hat=).  The parameter vector ‘β_ols‘ minimizes this distance.  The most important assumption in order for β to reflect to true parameters in the population is for the regressors to be uncorrelated with the error terms (cov(x,e)=0).  Sometimes this is not the case.  The assumption fails if:

  1. There are omitted variables which are correlated with the regressors (x)
  2. We have a system of simultaneous equations.
  3. There is an errors in variables problem
  4. The system has a lagged dependent variable with a serially correlated disturbance