Econometrics

You are currently browsing the archive for the Econometrics category.

Econometrics

You are currently browsing the archive for the Econometrics category.

Price indices are useful for calculating inflation over time.  The consumer price index (CPI) measures changes in prices for the overall economy.  Researchers can also use price indices to understand the evolution of the price of health care over time.  For instance, the Bureau of Labor Statistics also calculates a CPI for Medical Care and Medical Care Services.

The question of how to calculate a price index is far from trivial however.  To calculate the change in the price of any good between years 1 and T, one could simply use the following formula:

  • Psimple=piT/pi1

However, a price index indicates the change in prices for a basket of goods.  If you are considering the change in price of 10 medical services, how much weight to you give to each one?

Economists have generally come up with the solution: the goods that make up a large share of total expenditures should be weighed more than those that make up a small share.  For instance, let us imagine a simple example where you have two expenses: food and medical care.  The price of food goes up by 10% and the price of medical care goes up by 20%.  Let us assume that food makes up a larger share of your budget than medical expenses and that the initial value of the price index is 1.0 (i.e., T=1).  Thus, if 80% of your income goes to food and 20% of your income goes to medical expenses, than the value of the price index one year from now would be would be 80%*1.1+20%*1.2=1.12.

Sounds easy right?  Not so fast.

I said that 80% of the person’s budget was made up by food, but does that figure refer to your budget expenditures in the first time period or the second time period?  Let us assume the following:

  • Pfood,1=$1; Qfood,1=800; Efood,1=$800;
  • Pfood,2=$1.1; Qfood,2=800; Efood,2=$880;
  • Pmed,1=$100; Qmed,1=2; Emed,1=$200;
  • Pmed,2=$120; Qmed,2=3; Emed,2=$360;

Above, P, Q and E refers to price, quantity and expenditures respectively; the first subscript in the formulas above refers to the good (food or medicine) and the second subscript refers to the time period (1 or 2).  In the example, 80% of the person’s budget in period 1 is for food and 20% is for medical supplies.  If we use the budget shares in the first period to weight the price changes, then we could calculate the price index as:

  • (800*$1.1+2*$120)/(800*$1+2*$100)=1.120

This method is known as the Laspeyres price index.  The general formula is: [Σ pitqi0]/[Σ pi0qi0].

An alternative measure is the Paasche  price index.  In this case, we weight the price changes depending on the bundle of goods in the last time period under consideration.  In the example, our price index would be:

  • (800*$1.1+3*$120)/(800*$1+3*$100)=1.127

The price index is higher now.  Why?  In the last period, the quantity of medical care we purchased increase (for 2 to 3) compared to the quantity of food purchased (stayed the same at 800).  This means that the Paasche price index will put relatively more weight on the price changes for medicine.  Since the price of medicine increased faster than the price of food, the overall price index level be higher in this example than in the case of the Laspeyres price index.  The general formula for the Paasche price index is: [Σ pitqiT]/[Σ pi0qiT].

However, both the Laspeyres and Paasche indices do not take into account substitution effects between goods. Goods are weighed statically based on the quantity purchased in either the first period (Laspeyres) or last period (Paasche). To solve this problem, one can use the Fisher price index. This index does account for individuals substituting across different types of goods. To calculate the Fischer index, one simply takes the geometric mean of the Laspeyres and Paasche indices. According to the example above, this means the price index would be:

  • Pf=(Pp*Pl)0.5=(1.120*1.127)0.5=1.123

One can also chain the Fisher index calculations from each year in order to produce a chain-weighted Fisher price index, but I’ll save that explanation for another day.

Are foreign-born individuals more likely to be literate (in English) than native born Americans?  One would think not, but consider the following information:

Robinson (1950) computed the following two pieces of information: the percent of the population who are foreign-born, and the percent who are literate.  Robinson observes that states with a high percentage of foreign-born individuals have higher literarcy rates.  There is a 0.56 correlation between a state’s proportion of foreign born individuals and a state’s proportion of individuals who are literate. Does this mean that being foreign-born causes an increase in literacy?

Actually no. This correlation is an ‘ecological’ correlation,  because the unit of analysis is not an individual person but a group of people—the residents of a state. In reality, the association is negative: the correlation computed at the individual level is -0.11.

These figures can be explained as follows.  Let us assume that high income states have high levels of literacy.  Also assume that foreign born individuals have low levels of literacy.  Because immigrants are drawn to states with high income and the potential for economic growth, one could see a a positive correlation between literacy and foreign born individuals on a state level.  High income states have high literacy among natives, but many foreign born individuals.  Low income states have low literacy levels among natives, but also few immigrants.  Thus, one could see a positive correlation between proportion of foreign born individuals and literacy on an aggregate level, but a negative correlation on an individual level.  

This is the problem of ecological inference.  As David Freeman explains, ecological inference occurs when inferences about individual behavior drawn from data about aggregates.  Stereotypes are another example of ecological inference.  In this case, one assumes that individual members of a group have the average characteristics of the group at large.

How does one model a demand system? In general, researchers only observe the equilibrium prices and quantities of goods over time. Changing prices or quantities could be due to shifts in either the demand or supply curve. Thus, modeling demand systems is difficult.

Deaton and Muellbauer (1980) propose one method: the Almost Ideal Demand System (AIDS). Today I will review this demand estimation strategy.

Origin

The origin of the AIDS system comes from the piglog model. Piglog models allow researchers to treat treat aggregate consumer behavior as if it were the outcome of a single maximizing
consumer. One must assume that in equilibrium, the marginal propensity to consume is the same across households.

One could add sophistication to the model by including parameters–kh in the paper–which measure household size, age composition, and other household characteristics. In general, Deaton and Muellbauer assume that all the kh parameters are equal to one; thus implying that all household have similar preferences.

Estimation

The AIDS estimations strategy is attractive because it is simple to estimate and–under certain assumptions–avoids the need for non-linear estimation. Further, it can also test homogeneity and symmetry restrictions through assumptions on the parameter values in the estimation.

Deaton and Muellbauer (1980) describe how one can begin with primitive, individual utility functions and aggregate them to form the following estimation framework:

  • wi=(αiiα0) + Σj γijlog pj + βi*{log x – Σkαk *log pk – .5*ΣkΣjkj*log pk*log pj)}

In this equation, wi represents the budget share of good i. The index j indexes the good. Prices are represented by p, total expenditures are represented by x, and w represents budget shares. One can estimate all parameters using a maximum likelihood methodology. In general, the following three restrictions are imposed to simplify estimation:

  1. Adding up restriction: Σi αi=1; Σi γij=0; Σi βi=0;
  2. Homogeneity restriction: Σj γij=0;
  3. Slutsky Symmetry restriction: γijji

One can simplify this estimation in situations where prices are closely collinear. Instead of using an exact price index, P, one could calculate an approximate price index P*. One candidate recommended by Deaton and Muellbauer is Stone’s (1953) index: log P*=Σk (wk*log pk). If P* is a good approximation, then one can use the following equation as an approximation of the full estimation above:

  • wi=(αiilog φ) + Σj γijlog pj + βilog(x/P*)

Testing Restrictions

In order to test the homogeneity restriction, one can leaves out a single pj term and instead focus use relative prices (pj/pn).

  • wi = αi*j=1n-1ij*log(pj/pn)} + βilog(x/P*)

An F-ratios are calculated for each of the i equations to determine if the homogeneity restriction holds. Next the paper also gives the steps needed to test the symmetry restriction as well. In order to conduct the symmetry test, one must calculate the price index as follows:

  • log P = α0 + Σk αk*log pk + 0.5*ΣkΣjkj*log pk*log pj)}

In order to calculate this “correct” price index, one must choose an appropriate value for α0.

Source:

The book Black Swan by Nassim Nicholas Taleb is an interesting book about probability outside of the traditional Gaussian framework and how paradigm changing often arise.  The highlight of the book is its philosophy of the black swan, and its unknown unknown.  The book also includes discussion of behavioral economics and tries to discredit Gaussian statistics.  The book is interesting but rambles somewhat.  Further, Taleb writes in a condescending manner disparaging other intellectuals and experts.  Although Taleb does make some good points but the negative tone does become tiresome.  

The Turkey Problem

The crux of the book can be understood by looking at the following series.  This series represents the weight of the turkey over 30 days.

Assume you are a turkey, what would you predict would happen to your weight over then next 15 days. Using ordinary least squares, one would predict that the turkey would continue to grow at 1/4 pound per day. Let us see what happened in reality.

We see that a “black swan” event has a occurred; one that was outside the paradigm one would establish based on past data. We see that on day number 41, the turkey is slaughtered. This is a huge paradigm shift from the point of view of the turkey. One can see that relying on past data to predict the future will be highly inaccurate in the presences of these black swans.

Other Non-linearities

Let us look at another seemingly linerar series.  

 

How would you predict the series would continue into the future?  Using linear extrapolation techniques, one would predict the series would increase linearly ad infinitum. However, let us examine the true data generating process.

We can see that the data come from a sine function.

The key insight of Taleb’s book is that these non-linearities, paradigm shifts and black swans occur all the time. Further, they are responsible for most of the innovatiations and important events in history. Thus, ignoring black swans can be perilious. Taleb’s message is one of humility.  It is exceedingly difficult to predict the future.  A sure thing is rarely ever such.  Thus, we should view expert opinion with some skepticism and embrace–rather than reject–uncertainty.

The Economist has an interesting article about the failure of macroeconomics to predict the latest downturn and what it means for the future of the profession of economics.  As one who has little faith in macroeconomics, I certainly can commiserate with the opinions of the following Ph.D. students:

According to David Colander, who has twice surveyed the opinions of economists in the best American PhD programmes, macroeconomics is often the least popular class. “What did you learn in macro?” Mr Colander asked a group of Chicago students. “Did you do the dynamic stochastic general equilibrium model?” “We learned a lot of junk like that,” one replied.

Many people model trend in the stock market using either Autoregressive (AR), Moving Average (MA) or Autoregressive Moving Average (ARMA). In these models, shocks to the stock price in prior periods help to determine the price in the current period. Shocks in the more distant past are generally assumed to have less influence on current stock price than that of shocks in the more recent past. However in simple models, the standard deviation of the shocks in each period are assumed to be the same (e.g., σ2ε,t2ε ∀ t).

GARCH models estimate volatility of these shocks in each period as a function of both the shocks and their standard deviations in prior periods. Generally, more volatility in the recent past will result in more volatility in the present.

A paper by Bauwens and Storti (2009) examines an interesting phenomenon: “the persistence of the volatility process tend to decrease after extreme events such as those observed in October 1987 and September 2001.” Standard GARCH models equally weight all shocks regardless of whether or not they occur during extreme events. In order to incorporate the observation that volatility decreases after extreme events, the authors propose a Weighted GARCH (WGARCH) model. In their model, shocks are modelled as follows:

  • ut = zt[st-dh1,t + (1-st-d)h2,t]1/2
  • hk,t=a0k + Σi=1 to p aiku2t-i + Σj=1 to q bjkhk,t-j for k=1,2

The term s is in essence a weight which determines the volatility. If h1,t represents the high volatility period and h2,t represents the tranquil period, the term s balances the regime in which we are currently situated. For instance, when s approaches 0 we are in the tranquil regime and when s approaches 1, we are in the volatile regime. Model parameters can be estimated using MLE or Bayesian approaches.

This WGARCH model seems to be useful when the persistence of volatility varies into two different types of regimes. It will be interesting to see if the model produces a good fit after analyzing the stock returns after the most recent financial meltdown.

When beginning your research, here are the questions you need to ask yourself [from Mostly Harmless Econometrics]:

  1. What is the causal relationship of interest?  What specific mechanism will cause a change in the dependent variable of interest?  Often one uses economic theory to predict these causal relationships.
  2. What experiment could be used to capture the causal effect of interest?  Before you can decide on an identification technique, one must figure out what the ideal experiment would be.  If you want to estimate the effect of physician payment on surgery rates, would you randomize patients to different physicians?  Different physicians may select into different payment schemes.  Would you randomize physician payment?  In this case, different types of patients may select different doctors.  What would be the ideal?
  3. What is your identification strategy?  Many medical studies use randomized control trials, but there are very few RCTs investigating economic phenomenon.  A researcher must decide how to eliminate problems of selection and endogeneity.  Common strategies include OLS, difference-in-difference, instrumental variables, and others.
  4. What is your mode of statistical inference?  This is the nitty-gritty stuff.  How will you estimate your standard errors?  What variables do you include in your regression?   Is the sample representative?  What is the correct group to study.  In my paper on Marriage and Weight Gain, I limit the sample to individuals aged 18-55 since these are individuals most likely to be in the dating market.  

If we start with 1000 people and 10% of the population dies each year, how many people will be left in 10 years?  One could figure this out using manually.  However, for more complicated models, involving covariate predictors of survival, using survival analysis is helpful.

Survival analysis starts with a hazard function, λ(t), which gives the probability of failure each year for all survivors. In our simple example, λ(t)=.10  ∀ t.

From this we can calculate a hazard function:

  • λ(t)=limh →0 P(t≤T<t+h|T≥t)/h

The variable T is the number of periods the person survives. We also have an associated cdf, F(t), which is equal to the cumulative probability of failure for T≤t. Thus we can calculate a survival function, S(t)=1-F(t), which gives the probability a person will survive to some period after period t.  The pdf is equal to the derivative of the cdf, or also f(t)=S(t)*λ(t).

By knowing the hazard function, we can also calculate many probabilities of interest.  For instance, if a2>a1:

  • P(T≥a2|T≥a1)=exp{-∫a1 to a2 λ(s) ds}
  • P(a1≤T≤a2|T≥a1)=1-exp{-∫a1 to a2 λ(s) ds}

Wooldridge (2001) uses the example of recidivism.  Let λ(t) equal the hazard rate that criminals freed freed from jail commit another crime.  The term  λ(13) is equal to the probability a person is arrested 13 months after their release conditional on not having been arrested for a year.

Weibull Example

One example of a common hazard function is the Weibull function.  In the Weibull, we have:

  • f(t)=αγtα-1exp{-γtα}
  • λ(t)=γαtα-1
  • S(t)=exp{-γtα}

The Weibull is an attractive model because the hazard rate need not be constant over time. Also, the Weibull distribution is simple to understand. The parameter γ determines its shape and the parameter λ determines its scale. Further, if the hazard rate depends on individual characteristics, we can condition the value of λ on a vector of covariates.  If α=1, then the Weibull simplifies to the exponential distribution. This is a “memoryless” distribution where the hazard rate is constant over time (i.e., λ(t)=λ).

Proportional Hazard Models

Often, you will want to see how different covariates affect the hazard rate.  A very simple model to use is the proportional hazard model.  Here, the baseline hazard is constant and covariates have a multiplicative effect on this baseline hazard.  For instance:

  • λ(t;x) = κ(x) λ0(t)
    • κ(x) =exp{βX}
  • ln λ(t;x) = βX + ln{λ0(t)}

Let us assume that our null hypothesis is that when someone is sick, it is not swine flu.  A type I error is a false positive.  That is, we claim that the person has the swine flu, when actually then do not.  A type II error is a false negative.  This means that the person has swine flu, but we erroneously conclude that they do not.

What is the probability that someone who has flu-like symptoms actually has swine flu?  We can calculate this using Bayes Rule:

  • P(H1N1|symptoms) = P(Symptoms|H1N1)*P(symptoms)/P(H1N1)

Let us assume that all individuals with swine flu have symptoms so that P(Symptoms|H1N1)=1.  Let us assume 2% of the population gets any type of flu each year and displays symptoms.  Let us assume only .02% of the population gets H1N1.  So, P(symptoms)=0.02 and P(H1N1)=.0002.  Thus we have:

  • P(H1N1|symptoms) = 1*0.02/0.002=.01. 

This means that if we see a random person with the flu like symptoms, there is only a 1% chance that they actually have the swine flu.  

This may explain why the CDC and WHO ignored early warnings from a Washington-based biosurveillance company concerning a possible flu outbreak.  Although there was an increase in the number of cases of influenza, the probability that it was an outbreak of H1N1 (or any type of outbreak) was low.  Although  probability of a false positive was high, the cost of a false negative is also large.  Ex-post, it is obvious that the CDC and WHO should have acted quicker to fight the spread H1N1.  Ex-ante, these organizations likely receive numerous reports of potential outbreaks and acting on every single one–most of which turn out to be false–would be very costly.  Identifying the optimal time to initial school closings and public health warnings is very difficult and must take into account both the probabilities and the costs of type I and type II errors.

Regression Discontinuity is an econometric method that has become popular in recent years.  Let me give you an example where regression discontinuity would be valid.  

Let us say that all students who score 1000 or more on their SATs matriculate at Ivy U and all students who score below 1000 attend college at State U.  The research question is what impact going to Ivy U has on wages.

If we simply compare the average salaries of those at Ivy U and those at State U, this will likely not reveal the true effect that Ivy U had on its graduates.  Those at Ivy U were likely smarter and more motivated than those at State U.  Thus, the impact of Ivy U’s education is confounded with the individual’s own talent and motivation.  

Regression discontinuity, however, can solve this problem.  If we compare individuals who scored just above and just below 1000, these individuals are likely very similar in terms of intelligence and motivation.  The only difference would the impact of Ivy U’s education and networking possibilities against State U’s.  We could compare average scores just above and below the 1000 mark.  However, we could also fit a polynomial function of test scores on wages with a discrete jump term at 1000.  Mathematically, this means the following:

  • Effect = limx↓c E[Yi|Xi=x]  -  limx↑c E[Yi|Xi=x]
  • In this example, Yi is the wages, Xi is the test scores, and the the cutoff value, c, is 1000.

Can we use Regression Discontinuity to estimate the impact of school districts on schooling?  We could compare houses on each side of the school district boundaries and then see if these similar houses have different test scores.  However, this will likely not produce reliable results if parents choose their house based on the school district.  Thus, even if two identical houses are right next to each other, if high achieving parents always choose the better school district, then there will be perfect sorting between school districts.

David S. Lee and Thomas Lemieux (2009)  have a great “user guide” about how to use Regression Discontinuity in practice.  Some of their top tips are the following:

RD designs can be invalid if individuals can precisely manipulate the “forcing variable”. 

  • In the school district choose example, where parents can precisely choose their school district RD may be invalid.

If individuals – even while having some influence – are unable to precisely manipulate the forcing variable, a consequence of this is that the variation in treatment near the threshold is randomized as though from a randomized experiment. 

  • Intuitively, when individuals have imprecise control over the forcing variable, even if some are especially likely to have values of X near the cutoff, every individual will have approximately the same probability of having an X that is just above (receiving the treatment) or just below (being denied the treatment) the cutoff – similar to a coin-flip experiment.  This is the case of people who score around 1000 on the SAT and thus have an approximately equal probability of getting into Ivy U or State U.

RD designs can be analyzed – and tested – like randomized experiments.

  • If variation in the treatment near the threshold is approximately randomized, then it follows that all “baseline characteristics” – all those variables determined prior to the realization of the forcing variable – should have the same distribution just above and just below the cutoff.

Non-parametric estimation does not represent a “solution” to functional form issues raised by RD designs. It is therefore helpful to view it as a complement to – rather than a substitute for – parametric estimation.

  • Parametric functions are what are traditionally used.  These are generally polynomial that regress the dependent variable of interest onto the X variable.  In my example, this would be a polynomial regression with future wages as the dependent variable and test scores as the independent X variable.  Non-parametric estimation techniques include local linear regression.

Goodness-of-fit and other statistical tests can help rule out overly restrictive specifications.

  • Although there is no simple formula that works in all situations and contexts for weeding out inappropriate specifications, it seems reasonable, at a minimum, not to rely on an estimate resulting from a specification that can be rejected by the data when tested against a strictly more flexible specification. For example, it seems wise to place less confidence in results from a low-order polynomial model, when it is rejected in favor of a less restrictive model (e.g., separate means for each discrete value of X). Similarly, there seems little reason to prefer a specification that uses all the data, if using the same specification but restricting to observations closer to the threshold gives a substantially (and statistically) different answer. 

Citation

Most people know that under the central limit theory claims, the distribution of the mean of a distribution will be normally distributed as the number of observations gets large.  The question is, if we have a series of discrete events that we want to approximate the distribution of the mean with a continuous distribution, should we estimate them with a normal distribution?

For instance, let us assume we have 20 observations on patient admittance to the hosptial and in 3 of those cases, the individual died.  we can use a binomial distribution to estimate the distribution of the prior as:

  •  nCrr(1-π) n-r

We can estimate π with the 3/20 = 0.15.  For our prior distribution, we could fit a normal distribution.  Using a normal distribution, however, would include values less than 0.  This is especially problematic if there is a small samples sizes (e.g., n=20).  A truncated normal would solve the problem of negative values, but eliminating one portion of the distribution will change the distribution’s mean and variance.

Another option is to use the beta distribution for the prior.  The beta distribution for the value of π is:

  • p(π) = {Γ(α + β)/[Γ(α)Γ(β)]} πα-1(1-π)β-1

If we apply Bayes’ theorm to the binomial data with a beta prior, we get:

  • p(π) ∝ πr(1-π)n-rπα-1(1-π)β-1
  • p(π) ∝ πα+r-1*(1-π)β + n-r -1

Now we have that the posterior distribution is Beta(α+r-1,β + n-r -1).  We already know r and n, and can match α and β with the methods of moments.

  • E(θ) = α/(α + β)
  • var(θ) = αβ/[(α + β)2(α + β+1)]

Now we estimate E(θ) and var(θ) with the sample moments. If 3/20 people died, then we estimate E(θ) with 3/20 = 0.15. Further, with a binomial distribution, we can estimate var(θ) with p(1-p)/n = .15*.85/20 = .00638. This means that the s(θ)=.006381/2 = .07984. Thus we can solve for α and β since we now have 2 equations and two unknowns.

How do you estimate the specific risk a smoking has on the probability of being hospitalized.  If smokers on average have lower income and less educational achievement, is smoking truly causing the increase in hospitalization or could the covariates fully or partially explain the increased hospitalization rates?

A paper by Kleinman and Norton suggests using adjusted risk ratios with logistic regressions.  The formula for this procedure is as follows:

  • ARR = [n-1Σi=1 to N riski(Xi|as if exposed)] ÷ [n-1Σi=1 to N riski(Xi|as if unexposed)]   (1)
  • ARD = [n-1Σi=1 to N riski(Xi|as if exposed)] – [n-1Σi=1 to N riski(Xi|as if unexposed)]   (2)

The authors explain the first equation as follows:

  • “The denominator of equation (1) is the mean of this calculated risk for each observation when the exposure variable is assumed to be unexposed and represents an MLE of the unexposed (baseline) risk for a population whose covariates are distributed as for the observed covariates for the entire study population. The numerator in equation (1) represents an MLE of the adjusted risk among the exposed. This approach is a specific example of using what are called “recycled predictions.”

Standard errors can be calculated using either bootstrapping or the Delta Method.  However, the authors wisely recommend bootstrapping the standard errors since it reduces the computations resources needed and can also allow for asymmetric confidence intervals.

Let us say you have 10 observations of 2 different variables.  How do you determine which of the observations to use?  Should you throw out the outliers?  Should you only include the most similar values?  Does more observations increase or decrease the amount of measurement error?

These problems can be answered by the discipline of Statistics.  An interesting book by Stigler recounts The History of Statistics.  Astronomers lead many of the statistical advances in the seventeenth and eighteenth centuries.  Accurate measurement is very important to astronomers.  Further, observations with respect to the circumference and oblateness of the earth were made at different times and places throughout history.  This leaves a conundrum of  how best to combine these observations.

Mayer, Boscovich, and others contributed to the development of the idea of least squares, but Stigler credits Legendre with the invention of least squares.  Legendre came up with the idea in his attempt to measure the length of the median quadrant (the distance from the equator to the North Pole) through Paris.  

To demonstrate some of his ideas, I will use a simpler example.  Let us assume that a drug can have a dosage level between 0 and 5 and we want to find it’s impact on health (measured from a 0-10 scale).  Let us look at the following data.  The goal is to find the parameters m (slope) and b (intercept) that accurately measure the relationship between drug dosage and health (ignore any questions of endogeneity).  Should we include all 10 observations?

Although Euler recognized that including more observations increases the maximum possible error, Legendre realized that adding more observations also greatly increased the probability of getting close to the true value of the parameters of interest.  

In my example, we need to fit a line to measure the parameters m and b.  How do we set up the errors so that we have the most accurate calculations.  Laplace believed that the following two conditions would need to hold:

  1. Σi Dosagei*ei = 0
  2. Σi |Dosagei*ei| = minimum

The first condition basically says that the errors are uncorrelated with the independent variables on average.  The second condition hopes to minimize the errors.  Legendre extended Laplace’s second condition to minimize the sum of the squared errors rather than just the absolute error level.

Another key point is that this regression line must go through the “center of gravity.”  In my example, the average dosage for the ten observations is 2.2 and the average health level is 5.9.  This means the center of gravity is at the coordinates (2.2, 5.9).  In the solution in my example is to set m=1.1456 and b=3.3797.  We see that if we plug 2.2 into the equation, the output is 5.9; thus, the regression line does indeed go through the center of gravity.

Understanding the historical development of modern statistical techniques is an interesting task, and Stigler’s book enlightens the reader with much detail.

ANOVA

Let us say that you are a hospital administrator.  You are very clever and have come up with a system to score the quality of the work done by the physicians at your hospital.  To simplify things, lets assume that you only have 3 physicians who work at your hospital.  The physician’s scores are as follows:

  • Dr. Albert: 76, 85, 91, 67, 73 
  • Dr. Burns: 92, 90, 60, 79, 75
  • Dr. Collin: 50, 80, 83, 80, 74

The average score for Dr. Albert is 78.4, for Dr. Burns is 79.2 and for Dr. Collin is 73.4.  As the hospital administrator, you want to know whether these differences are due to differences in doctor quality or likely from random chance.  If there were only two doctor’s a t-test would suffice, but what tests can you use in the case of multiple doctors?

The solution to this is to run an ANOVA test.  How do we do this?  Follow these easy steps.

  1. Let j be the group number (j=a, b, c) and i be the number obervation within each group (i=1, 2,…,5)
  2. Calculate the mean of each group (μj): μa= 78.4; μb 79.2; μc= 73.4.
  3. Also calculate the mean of the entire sample. μ=77
  4. Now calculate the Sum of Squares within each group [SSwithin = ΣΣ (Xij - μj)2].  This shows how much variation there is for each doctor.
    • SSa = (76 – 78.4)2 + (85 - 78.4)2 + (91 - 78.4)2 (67 – 78.4)2 + (73.4 – 78.4)2 = 367.2 
    • SSb = 666.8
    • SSc = 727.2
    • SSwithin  = SSa + SSb +  SSc = 1761.2
  5. Now calculate the Sum of Squares between each group. [SSbetween =Σ njj - μ)2].  This shows how much variation there is across each of the doctor’s average score.
    • SSbetween = 5*(78.4 -77)2 + 5*(79.2 – 77)2 + 5*(73.4 – 77)2 = 98.8
  6. The F-statistic is calculated as the mean square (MS) statistic for the between and within sum of squares (SS).  How do we go from the SS to the MS?  That’s easy, we just divide both by the degrees of freedom.
    • MSwithin  = SSwithin/(N-J) = SSwithin/13.  This is because there are 15 observations and 3 doctors so 15-3=12.  Our answer here is: 1761.2/12 = 146.77
    • MSbetween = SSbetween/(J-1) = SSwithin/2. This is because there are 3 doctors, we have 3-2=4. Our answer here is: 98.8/2 = 49.4.
  7. Now we can calculate the F statistic as: F = MSbetween/MSwithin = 49.4/166.77 = .337
  8. If we look this up on an chart for F-statistics, we see that the probability that all 3 doctors are equally good is .721.  Thus, we fail to reject the null that all three doctors are equally good.

STATA

Is there an easier way to do this?  Yes.  If you have Stata, you could just use the score as the dependent variable and have dummy variables for Drs. A, B, an C. The you can run a statistical test that the coefficient estimate for Dr. A = the coefficient estimate for Dr. B = the coefficient estimate for Dr. C.  This will give you the same probability that the three doctors are equally skilled that we calculated manually above.

Nobel laureate James Heckman has a nice summary of how applied econometricians and policy researchers should define causality. Some of the more interesting points I have excerpted below.

On the source of randomness in a sample

One reason why many statistical models are incomplete is that they do not specify the sources of randomness generating variability among agents, i.e., they do not specify why otherwise observationally identical people make different choices and have different outcomes given the same choice. They do not distinguish what is in the agent’s information set from what is in the observing statistician’s information set, although the distinction is fundamental in justifying the properties of any estimator for solving selection and evaluation problems. They do not distinguish uncertainty from the point of view of the agent whose behavior is being analyzed from variability as analyzed by the observing analyst. They are also incomplete because they are recursive. They do not allow for simultaneity in choices of outcomes of treatment that are at the heart of game theory and models of social interactions and contagion (see, e.g., Brock & Durlauf, 2001; Tamer, 2003).

Unbundling a treatment

Researchers often say that a policy change will cause a change in some outcome measure. However, a policy change is often made up of many components. Which components of the policy change actually influenced the outcomes? In Heckman’s words:

Many causal models in statistics are black-box devices designed to investigate the impact of “treatments”—often complex packages of interventions—on observed outcomes in a given environment. Unbundling the components of complex treatments is rarely done. Explicit scientific models go into the black box to explore the mechanism(s) producing the effects.

Outcomes vs. Utilities

Most researchers pick an outcome variable of interest and if the outcome increases–assuming a beneficial outcome measure–than people are better off. This may not be the case however. For instance, Bill Clinton’s welfare reform act (PRWORA) may have increased employment rates and income for single mothers, but the mother’s utility may have decreased. The single mothers may (or may not) have valued spending time caring for their child more than working.

Problems with non-linearity

Issues such as “social interactions, contagion and general equilibrium effects” can complicate causal inference.

What are you measuring?

Let us assume that Y is the outcome variable of interest. Y depends on what state, s, you are in. For instance, in a treatment/no treatment world, Y(s) is the outcome if you would be treated and Y(s’) is the effect if you were not treated. D(s)=1 if you were actually treated and D(s)=0 if you did not receive treatment in the data. Thus, we can measure various things:

  • Average Treatment Effect (ATE): E (Y s) − Y(s’)). This is equal to the average effect if all individuals moved from a untreated to a treated state.
  • Treatment on the Treated (TT): E[(Y(s) − Y(s')) | D(s) = 1]. This looks at the average effect of treatment only on those who were treated. This is important if only certain individual select into the treatment group, or if the policy change is only relevant for certain individuals.
  • Treatment on the Untreated (TUT): E[(Y(s) − Y(s')) | D(s) = 0]. It is also possible that treatment can affect those who are not treated. For instance, instituting a work training program for treated individuals may reduce community college enrollment and thus may affect untreated individuals (e.g., if the community college closes from lack of enrollment).
  • Policy relevant treatment effect (PRTE):Ep[Y(s)] − Ep’ [Y(s)]. The estimator compares the average outcomes of two different policy choices.

Heckman, James (2008) “Economic Causality” NBER WP #13934.

Much of health care data is characterized by a large cluster of data at 0, and a right skewed distribution of the remaining outcomes. For instance, people who do not get sick generally use $0 of medical care. Those who do get sick, use a varying amount of medical care dollars, but there are a large number of outliers with extremely expensive medical care. How do health economists take these anomalies into account?

David Madden looks at two alternatives to correct for the shape of the distribution in his 2008 JHE paper: sample selection and two-part models. Zero consumption of medical can be caused from two different decisions: a participation decision and a consumption decision. For instance, in the case of smoking, individuals may decide not to smoke no matter how cheap cigarettes get (participation decision). On the other hand, some smokers may decide not to smoke during a given time period because cigarettes are very expensive or they have low income (consumption decision). Since people can not smoke negative cigarettes, there still may be a cluster of observations around zero.

Assume that individuals utility from participation is equal to w=α’Z + v. If w>0, then d=1, (the individual participates) and if w<0, then d=0, (the individual does not participate). For consumption, individuals will choose y**=max[0,y*]; y*= β’X + u. A general model can be written as follows:

  • L0 = Π0 [1-P(v>-α'Z) P(u>-β'X |v>-α'Z)] Π+ P(v>-α’Z) P(u > -β’X|v>-α’Z) g(y|v>-α’Z,u > -β’X)

If u and v are independent, then we have the Cragg model:

  • L10 [1-P(v>-α'Z) P(u>-β'X)] Π+ P(v>-α’Z) P(u > -β’X) g(y|u > -β’X)

If we assume that the participation constraint dominates the consumption constraint (which is likely in the smoking example, but maybe not for drinking), then we have P(y*>0|d=1)=1 and g(y*|y*>0,d=1)=g(y*|d=1). This means that if you are a smoker you will have at least one cigarette per period. When the participation constraint dominates, we ignore the consumption decision and we have the following likelihood function which corresponds to the Heckman Selection model.

  • L20 [1-P(v>-α’Z) Π+ P(v>-α’Z) g(y|v>-α’Z)

If independence is assumed, then we are left with probit for participation and OLS for consumption. This is the two part model:

  • L30 [1-P(v>-α’Z) Π+ P(v>-α’Z) g(y)

Which of these models works best empirically?

Results

Madden looks at the fit of regressions trying to model smoking and drinking behavior using a wide variety of covariates. In general, the two-part model seems to be perform better in the data used for this study, but the author wisely notes that deciding between the Heckman selection and the two-part model should be done on a case-by-case basis.

Let us assume that there are two types of people: smart people an dumb people. Smart people’s test scores are normally distributed about 80% and dumb people’s tests scores are normally distributed about 40% on their test. If we observe the test score of one person, how do we know if they are smart or dumb? If we see a score of 85%, we are pretty sure they are smart. A dumb person might have had a good day, but this would be a low probability event. Similarly, if we saw a score of 35%, we would be fairly certain that the person is dumb, even though there is a small probability that a smart person may have had a bad day. If we see a score of 62%, however, then it is very difficult to distinguish if the person is smart of dumb. But how can we quantify the probabilities that a person is of a certain type.

One way of doing this is finite mixture models. Jim Hamilton’s Time Series Analysis book has a good explanation of this topic and I will review this material here.

Each type (e.g.: how smart the person is) will be designated as st=1,2,…, or N. Let us assume that there is an observed variable yt (e.g.: the test score) which is distributed according to a N(μsj2). What researchers wants to know is that given that we observe yt, what is the probability that the observation is from a person of type st=j.

Let us assume that we know the density of yt is:

  • f(yt|st=j;θ)=(2πσj2)-1/2 * exp{-(yt – μj)/2σj2}

There is also some underlying distribution of types.

  • P(st=j;θ)=λj
  • θ=(μ1,…,μN1,…,σN1,…,λN)

From Bayes Rule, we know that:

  • P(A and B)=P(A|B)*P(B), which implies
  • f(yt,st=j;θ)=λj*(2πσj2)-1/2 * exp{-(yt – μj)/2σj2}

The unconditional density can be found as follows:

  • f(yt;θ)=Σ1 to N p(yt,st=j;θ)
  • f(yt;θ)=λ1*(2πσ12)-1/2 * exp{-(yt – μ1)/2σ12} +…+λN*(2πσN2)-1/2 * exp{-(yt – μN)/2σN2}

Now we can use maximum likelihood estimation techniques to find the θ which will maximize:

  • maxθ L(θ)=Σ1 to Tlog f(yt;θ)
  • s.t.: λ1 + λ2 +…+ λN=1
  • s.t: λj≥0

Once we have the MLE estimated θ, we can figure out what the probability is that observation yt came from a person of type st=j. Using Bayes theory, again, we know that:

  • P(st=j|yt;θ)=f(yt,st=j;θ)/f(yt;θ)=λj*f(yt|st=j;θ)/f(yt;θ)

This value represents the probabilty, given the observed data, that the unobserved type responsible for observation t was in of type j. For example, “…if an observation yt=0,, one could be vertually certain that the observation had come from a N(0,1) distribution rather than a N(4,1) distribution, so that P(st=1|yt;θ) for that date would be near unity. If instead yt were around 2.3, it is equally likely that the observation might have come from either regime so that P(st=1|yt;θ) for such an observation would be close to 0.5.”

Most of the above content came is from:

  • James D. Hamilton (1994) Time Series Analysis, Princeton University Press, Princeton, NJ; pp. 685-689.

What is the effect a country’s GDP on health? What about the country’s literacy rate on infant mortality rates? Often researchers try to answer these questions using time-series data. With time series data, we have observations of a few units (e.g.: countries or individuals) over many years.

Let the subscript i represent the the individual or country and the subscript t indicate the year. We can have a regression framework as follows:

  • yit = βxit + εit

As long as cov(xitit)=0, then ordinary least squares (OLS) will provide an unbiased estimate of β1.

One frequent problem which occurs with time series data is that there will be serial correlation. Serial correlation (or autocorrelation) occurs when the error terms are correlated over time. For instance,

  • εit=ρεit-1it

Serial correlation means that if your predicted y value is overestimated in period, it is likely to be overestimated in another period. This is likely due to some persistent variable omitted the regression. For instance, if we regressed test scores on a vector of explanatory variables, it is likely that student who scored higher than their predicted test score in one period would also score higher then their predicted test score in another period.

Fortunately, our coefficient vector (β) is still unbiased even in the presence of serial correlation. However, OLS is inefficient. In this case, the standard errors are too small.

One way to test for serially correlation is to use the Durbin-Watson test. Let uit be the fitted values of the error terms after we conduct and OLS regression (uit = yitβols xit ).

The Durbin Watson statistic is:

  • d= [Σ(t=2 to T) (uit - uit-1)2] / [Σ(t=1 to T) (uit)2]

With panel data we have:

  • d= [Σ(i=1 to N)Σ(t=2 to T) (uit - uit-1)2] / [Σ(i=1 to N)Σ(t=1 to T) (uit)2]

This page will help you interpret the statistic as to whether or not you should accept or reject serial correlation. If there is serial correlation in your data, you may want to include a lagged dependent variable as one of your right hand side variables. This will result in an AR(1) specification.
Yuting Wang of Notre Dame has a good explanation of the problems that occur with serial correlation.

Most public health officials believe that increasing the supply of primary care doctors is almost always a good thing, while increasing the number of specialists can have mixed results. One problem is that physician supply is endogenous. One may believe that physicians prefer to locate in wealthier areas. If wealthier people are also healthier, then a correlation will exist between physician supply and health even though no causality exists.

In order to isolate the direct causal effect of increasing family physician supply, Gravelle, Morris and Sutton (2008) use an instrumental methods methodology. The two instruments for physician supply are: an index of local area housing prices and average age-related capitation payments. Since physicians location decisions are regulated by the Medical Practices Committee and do not include a cost-of-living adjustment, we would expect lower physician supply where there housing prices are higher. Local area average capitation payments should not effect any individual’s health, but should attract increased family physician supply.

These instruments are implemented on the Health Survey of England data set. Physician supply comes from the General Medical Services (GMS) Statistics database.

Health levels are either measured as very good, good, fair, bad, or very bad. In this case, an ordered probit regression is used. The authors also utilized the EQ-5D continuous scale health measure. With the continuous variable, a least squares regression model is used. What are the results?

When no instruments are used FPs [family physicians] have a positive but statistically insignificant effect on health. When FP supply is instrumented by age-related capitation it has markedly larger and statistically significant effects. A 10 percent increase in FP supply increases the probability of reporting very good health by 6 percent.

Since almost all medical care and pharmaceuticals are free to patients, increased physician supply will not act to reduce prices. Nevertheless, more family physicians can make going to the doctor more convenient and can reduce waiting times, thus increasing the number of family physician visits per individual per year.

One interesting econometric technique used in this paper is that of the anti-test. A paper by Dranove and Meher (1994) criticizes the use of instrumental variables because the use of some instruments can be used to “prove” that increased physician supply “causes” increased childbirth. This is obviously a nonsensical correlation. In this paper, the authors use instrumented and noninstrumented family physician supply to see these variables have any effect on the individual’s ethnicity. Neither the instrumented or noninstrumented physician supply has any impact on ethnicity. Thus, we have some indication that the two instruments chosen by the authors are valid.

Randomized clinical trials (RCTs) are the “gold standard” for medical studies. Nevertheless, even RCTs have their problems. An NBER working paper by Ludwig, Marcotte and Norberg points highlights some of these issues. The authors examine whether or not anti-depressants reduce suicide rates (they find that anti-depressants do reduce suicide rates).

Unfortunately, using data from RCTs will not give an accurate picture of an anti-depressant’s impact on suicide. For one, RCTs have relatively small sample sizes due to their expense. Since suicide occurs very infrequently, it will be difficult to pick up an statistically significant differences in suicide rates between the treatment and control groups. Secondly, people at high risk for suicide will likely be excluded from the RCT for ethical reasons. Thus, the RCT may have a sample which will under-represent individuals with suicidal tendencies.

Traditional instrumental variables (IV) econometric methodologies often fail to take into account response heterogeneity. Response heterogeneity based on characteristics not observed by the researcher can create a heterogeneity in the self-selection process. For instance, one group of people who elect to receive surgery may have knowledge of a family history where surgery is typically successful, whereas another group may elect not to receive surgery due to a different family history. If this information is unobservable to the researcher than an analysis of the average of effect of surgery may be biased. In the medical context, traditional IV assumes that:

  1. treatment effects are constant conditional on observed characteristics, or
  2. if treatment effects are heterogeneous, patients or physicians cannot anticipate these effects and use this information to select the most beneficial treatment.

In traditional IV, the treatment parameter gives researchers a local average treatment effect (LATE). But can a researcher characterize a heterogeneous response using IV? A solution to this problem is presented by Basu, Heckman, Navarro-Lozano and Urzua in a 2007 Health Economics paper. They use a local IV to estimate marginal treatment effect (MTE) parameters.

Basic Econometrics Review

Let us assume that a person will have two different outcomes based on whether or not they are treated:

  • Y1 = μ1(X) + U1
  • Y0 = μ0(X) + U0
  • Δ = Y1 – Y0 = {μ1(X) -μ0(X)} + (U1 – U0)
  • Y=μ0(X)+D*{μ1(X)-μ0(X)} + {D(U1 – U0) + U0}

The variable Y1 represents the outcome if the person is treated and Y0 represents the outcome if they are not treated. We only have one observation per person, however, since we cannot observe the counterfactual. If we could observe the counterfactual, Δ would give us the effect of the treatment for each person. Unfortunately we only observe Y. The dummy variable D is equal to unity if the person is in the treatment group and zero otherwise. If there were a randomized trial where people are randomly placed into the treatment and control groups, it would be easy to estimate the treatment effect by comparing the mean outcomes of the treated and control groups. We could examine the mean outcomes for individuals with similar characteristics to determine the treatment parameter by subgroup. However, if individuals can select whether or not to be treated, the error term–which may be composed of unobserved heterogeneity in the effectiveness of the treatment–may be correlated with the regressors that impact the outcome.

The traditional solution to the endogeneity problem is IV. Let X be the set of regressors and Z represent the instruments. “LATE computes the mean gain to those induced to switch from no treatment to treatment by a change in Z from z to z‘.”

  • LATE={E(Y|X=x, Z=z‘)-E(Y|X=x, Z=z)} / {P(D=1| X=x, Z=z‘) – P(D=1| X=x, Z=z)}

Marginal Treatment Effect (MTE)

Developed by Björklund and Moffitt (1987) and furthered by Heckman (1997), the MTE measures “the average gain to patients who are indifferent between receiving treatment 1 [the treatment] versus treatment 0 [the control] given X and Z.” The benefit of using MTE is that one can calculate the marginal treatment effect for different subgroups based on the propensity score. This places a high degree of reliance on the accuracy and precision of the propensity score in order to determine these subgroup treatment parameters.

Let V denote a latent variable which measures the difference in benefits from being in the treated and control groups. Treatment choice can be modeled as follows.

  • V= μv(Z,X) + Uv
  • E(Uv)=0
  • D=1(V>0)

The authors use a propensity score to determine the probability of selecting treatment.

  • P(z,x)=P(D=1|Z=z, X=x) = P(Uv > -μv(z,x)) = 1 – FUv(-μv(z,x))
  • FUv() is the cdf of Uv.

Now we can define MTE to be:

  • MTE(x,z)=E(Δ|X=x, Z=z, V=0)
  • =E(Δ|X=x, Z=z, Uv=-μv(z,x))
  • 1(x) -μ0(x) + E{U1 – U0|Uv=-μv(z,x)}
  • 1(x) -μ0(x) + E{U1 – U0|UD= FUv(-μv(z,x))}

where FUv(Uv)=UD. The last equation after the ‘|’ is a monotonic transformation of the terms after the ‘|’ in the third equation.

Local IV (LIV)

The LIV estimates the derivative of the expected outcome conditional on observed characteristics and the probability of electing to be in the treatment group, E(Y|X=x, P(z,x)), with respect to the probability of treatment, P(z,x). The term E(Y|X=x, P(z,x)) is defined as follows:

  • E(Y|X=x, P(z,x))=E{ DY1 – (1-D)Y0 |X=x, P(Z,X)=P(z,x)}
  • 0(x) + P(z,x){μ1(x) -μ0(x)} + E{U0|P(Z,X)=P(z,x)} + P(z,x){E{U1-U0 | P(Z,X)=P(z,x), D=1)
  • 0(x) + P(z,x){μ1(x) -μ0(x)} + K{P(z,x))

The term K(P(z,x)) is a general function of the propensity score, P(z,x). Often, K() will be a polynomial of the propensity score. The MTE can be computed mathematically as below:

  • {∂E(Y|X=x, P(z,x)) / ∂P(z,x)} |1-P(x,z)=UD
  • = μ1(x) -μ0(x) + ∂K(P(z,x))/∂P(z,x)

The equation above “…is implemented by regressing the outcome Y on all covariates [X], the propensity score, the interaction of the propensity score with all covariates and a polynomial on the propensity score.” This procedure is carried out in the paper empirically by applying these methods to data on breast cancer patients and their choice of breast-conserving surgery with radiation compared to mastectomy.

Can we estimate risk aversion and prudence using a survey question for the general public? This is what a paper by Eisenhauer and Ventura attempts to do.

Methods

In the 1995 Survey of Italian Households’ Income and Wealth, one question asked:

You are offered the opportunity of acquiring a security permitting you, with the same probabilities, either to gain 10 million lire [5165€] or to lose all the capital invested. What is the most you are prepared to pay for this security?

Assuming, the respondents answer honestly and precisely (which is a big assumption to make), the authors can create and individual’s utility function:

  • U(w)=0.5U(w-z)+0.5*U(w-z+10)

The variable w represents initial wealth and z is the amount individual would pay for a security. Using a Taylor expansion, we can create an estimate of absolute risk aversion.

  • 2U(w)=U(w)-zU’(w)+0.5z2U”(w) + (10-z)U’(w) + .5(10-z)2U”(w), or
  • [(50-10z+z2)/(10-2z)]*U”(w)=-U’(w)
  • A(w)=[(10-2z)/(50-10z+z2)]
  • R(w)=A(w)*w

The term A(w) represents the Arrow-Pratt measure of absolute risk aversion while R(w) is equal to relative risk aversion. If we differentiate the second equation above with respect to initial income, w, we can calculate a measure of prudence (-U”’/U”).

  • η(w)=A(w) + {(10-z)-1 + [2z/(100+z2)]}*∂z/∂w
  • Ï?(w)=w*η(w)

The term η(w) measures absolute prudence while Ï?(w) measures relative prudence.

Results

Since the authors have information regarding each individual’s initial earnings and various sociodemographic factors, they can analyze which type of people are risk averse.

  • Relative risk aversion is between 7.18 and 8.59.
  • Relative prudence is between 7.32 and 8.65.
  • The most risk averse groups are those in poor health and those with only an elementary school education.
  • The least risk averse are the college educated and those with health insurance.
  • Those with risk assets such as stocks or loans are less risk averse.
  • The authors claim that generally R(w)<Ï?(w)<R(w)+1 and risk aversion and prudence are highly correlated.

Healthcare Economist critique

Finding that people are risk averse and prudent is unsurprising, but the levels of risk aversion and prudence are very high compared to other studies. While having a vast array of sociodemographic information is important, simply eliciting a willingness to pay for a risky gamble is likely not a precise estimate of risk aversion. Likely, most people will respond to the question categorically (5 million lire, 4.5 million lire, 4 million lire, etc.). Further, finding that people with health insurance are less risk averse is counter-intuitive. One explanation is that having health insurance may be a proxy for wealth. Thus people with heath insurance in general could be more risk averse, but since this group of people is also richer (and more affluent people are generally less risk averse) we could have opposing effects.

Today I will review the insightful lecture of Willard Manning at European Science Days. Manning is most famous for his work with the RAND Health Insurance Experiment.

Problems with Healthcare Data

There are 4 major econometric problems one must consider when trying to analyze health care cost and utilization data:

  1. There is a large mass of individuals with zero utilization (or expenditures) during a given time period,
  2. Consumption among those with any care is very skewed (e.g.: visits, hospitalizations, expenditures),
  3. The dependent variable often responds in a non-linear manner to many covariates,
  4. demand response to covariates may change by the level of demand (e.g.: outpatient to inpatient, or low to high levels)

Log or Box-Cox Transformations

While using OLS is easy, it can often produce out-of-range predictions (i.e.: yhat=xβhat<0). Since health care data is skewed, many researchers decide to log the dependent variable in order to have a more symetric distribution of errors. The tradeoff of using logs is that although one gains precision and robustness, no one is interested in log-scale results per se.

The Box-Cox transformation of y is as follows:

  • [(yλ-1)/λ]=xβ+ε, if λ≠0
  • log(y)=xβ+ε, if λ=0

One estimates λ using MLE in order to minimize the skewness in the residuals.

Log Example

Using a log transformation implies that second moments often matter. For instance, let us assume log(y|g)~N(μgg), where treatment g=A, B. Then we know

  • E(y|g=A) = exp[μa+ 0.5(σa)2].
  • E(y|g=A)/E(y|g=B) = exp[(μab)+ 0.5{(σa)2-(σb)2}]

We can see from the second equation above, that the second moment of the distributions matters if there is heteroskedasticity, but not if there is homoskedasticity (i.e.: σab=σ)

Marginal Effects with log transformation

Calculating marginal effects with non-linear econometric formulations is often difficult.  For instance, we know that E(y)= exp(xβ)E{exp(ε)|x}. This implies that the marginal effect is equal to:

  • dE(y)/d(xk)=exp(xβ)[βkE{exp(ε)|x}+ d E{exp(ε)|x}/d(xk)]

This is much more complicated that the incorrect formulation that: dE(y)/d(xk)=exp(xβ)βk.

Generalized Linear Model Approach

In this method, one searches for the appropriate β’s to solve the following function:

  • Σ dμ(xβ)/dβ*V(x)-1*(y-μ(xβ))=0

In practice, one usually assumes that μ(xβ)=exp[xβ]. A variance structure is assumed so that Var(y|x)=α[E(y|x)]γ. The γ’s correspond to some standard parametric distributions:

  • Gaussian NLS: γ=0
  • Poisson: γ=1
  • Gamma: γ=2
  • Wald or inverse Gamma: γ=3.

Two Part Models

To this point, we have been focusing on the skewness problem and been ignoring the fact that many of the observations also clump at zero. We can decompose the expected value as follows:

  • E(y|x) = P(y>0)*E{y|y>0} + P(y=0)*0 = P(y>0)*E{y|y>0}

Now we must estimate P(y>0) and E(y|y>0) separately. The first part term we can estimate with a probit model [P(y>0)=Φ(xα). The second part one can log the y term to take into account skewness.

If the log-scale error term is normally distributed, then:

  • yhat= Φ(xα)*exp(xβ + .5σ2), where β, σ are estimated from the data.

If the log-scale error term is not normally distributed, than one can use the following formulation:

  • yhat= Φ(xα)*exp(xβ)*D
  • D is Duan's (JASA 1983) smearing estimator:
  • D=N-1Σexp[ε]=N-1Σexp[ln(y|y>0)-xβols]

Count Data

Count data in health economics is very common. The number of doctor visits, hospitalizations and ER visits all are types of count data. Poisson and Negative Binomial regressions are frequently recommended for these types of data.

The Nursing Home Compare website provides consumers with quality ratings of thousands of nursing homes (NHs) around the country. Are these ratings accurate? Could they be improved?

This is the question which researchers Arling, Lewis, Kane, Mueller and Flood analyze in their 2007 HSR paper. The authors find 2 major flaws with the rankings: 1) there is weak risk-adjustment and thus the ratings do not fully take into account the underlying characteristics of the population being served by the NH, and 2) there are no precision measures included in the rankings.

In order to improve the rankings, the authors use an empirical Bayesian (EB) shrinkage model with risk adjustment.

In the empirical Bayesian model, an empirical distribution serves as the prior. When new data are collected, these serve as the “Likelihood” or posterior distribution. Confidence intervals are constructed around the EB estimates from the posterior distribution. In this paper, the authors have data at both the resident and facility level. The prior distribution is estimated from using the total nursing home resident population. The posterior distribution is based on facility level data and the Likelihood function is the product of the two distributions. The authors explain in more detail:

“The influence of the facility’s observed QM [quality measure] rate on the posterior estimate will depend on the size of the facility and the amount of QM variation within and between facilities. The QM rates in larger facilities will be more certain (e.g., have lower standard errors) than in smaller facilities and, thus, will have greater weight or influence on the overall posterior (EB) estimate. Also, QMs with less variation between facilities have a more certain empirical prior (population average QM rate), which then has a greater influence on the posterior. As the prior tends to pull the posterior estimate toward the population mean, EB estimates are referred to as ’shrinkage’ estimates. “

Using the EB methodology, the standard deviations for most QMs “decreased considerably.” Smaller facilities experienced more shrinkage towards the mean due to their small number of residents. This is logical since one outlier patient would have a much higher impact on average QM rankings in a NH with 10 residents than another facility with 100 residents.

The risk adjustment is calculated in three ways: 1) simply excluding the sickest patients (i.e.: those with end-stage diseases or are in a coma), 2) group the sample in different risk strata, and 3) use a logistic regression to estimate a risk adjustment factor for each patient. Each of the risk adjustment methods was found to have a strong effect on the rankings.

One problem the authors acknowledge is that using EB and risk adjustment may let some facilities ‘off the hook.’ Small facilities with sicker than average patients may have low QM score because of an unlucky spate of ill patients or they may truly be poor facilities. Bayesian shrinkage moves their scores closer to the mean, so these facilities’ QM ratings are less responsive to quality improvements or backslides than larger facilities.

Bootstrapping

One of the biggest advances statistical modeling in the last 30 years has been the use of the bootstrap. For those interested in learning about the bootstrap in more detail, a good place to start is an article by UCSD math professor Dimitris N. Politis which I will summarize here. For more detailed information, one may want to look at An Introduction to the Bootstrap by Efron and Tibshirani.

Set-up

Suppose we have n observation of a random variable X. We can group these as a vectors so that X=(X1,…,XN), where each Xi are iid with distribution F. If we want to estimate a parameter θ(F) from the data, we can use a statistic T(X) as an approximation. If we assume that F~Normal, we can use traditional statistics to estimate T(X) as well as the confidence interval around θ(F). If we do not know the distribution of F (which a researcher problem does not in reality), then classical statical theory may be less reliable and a bootstrap methodology may be more robust. Bootstrapping methodology allows the researcher to better estimate F, especially if there is significant skewness to the F distribution.

The bootstrap procedure creates a new sample, by randomly sampling each observation in X with replacement until we have a new vector with N observations. We repeat this B times to create our bootstrap data set. Let’s look at an example..

Example

Pretend we have data on how many push up I have completed each day over a week. I want to estimate the median number of push-ups I do each day. In this sample, N=15 and since we will create ten bootstrap samples, B=10.

Obs. Data B1 B2 B3 B4 B5 B6 B7 B8 B9 B10
1 22 18 25 29 21 21 22 18 31 24 14
2 18 24 14 25 14 21 19 35 25 21 19
3 14 25 24 31 25 21 21 14 21 30 21
4 35 19 18 26 19 25 19 31 24 24 14
5 22 29 29 31 26 30 26 21 22 19 26
6 24 31 24 31 22 30 19 30 31 26 19
7 26 25 19 22 21 25 25 26 22 30 18
8 29 30 22 14 22 22 19 18 31 35 29
9 19 31 21 14 14 21 14 26 18 22 18
10 31 25 24 35 29 22 19 14 31 26 25
11 30 22 25 22 29 14 19 35 19 22 22
12 19 22 24 19 18 35 29 26 21 19 35
13 22 22 22 24 25 24 30 19 35 25 29
14 21 31 25 22 25 14 14 22 31 19 18
15 25 22 19 30 22 35 24 19 19 31 26
Mean 23.8 25.1 22.3 25 22.1 24 21.3 23.6 25.4 24.9 22.2
Median 22 25 24 25 22 22 19 22 24 24 21

The median of the actual data we have is 22. But we can also calculate the median using a bootstrap methodology. We first randomly choose one of the data points and put it as the first data point of B1 (the bootstrap sample number 1), we then resample with replacement and put another number as the 2nd observation of sample B1. We can see that data points often repeat. For instance in B1 observations X10 repeats twice. We see that the median varies across the 10 bootstrapping samples, but the average value for the median using the bootstrap methodology is 22.8.

We can also calculate the the bootstrap variance (3.36) and standard deviation (1.83). This are calculated according to the formulas:

  • Variance: B-1ΣiT(X*i)2 – [B-1ΣiT(X*i)]2
  • S.D. = (Var)1/2

Here, T(X*i) is the median for each bootstrap sample i. Since there are 10 bootstrap samples i=1,…,10. To calculate the variance, one simply averages the squared median over the 10 bootstrap samples and then you subtract the squared average median of the 10 samples.

Let us pretend you have a system of M equations, with N observations for each equation. For example, if we are estimating supply and demand independently over 20 years, M=2 and N=20.

If each of the regressors is predetermined in each equation and we have an exclusion restriction, we can use the Seemingly Unrelated Regressions (SUR) methodology to improve the efficiency of the estimates. SUR is simply computing the generalized least squares (GLS) estimate in the multivariate case. A more detailed explanation is given here.

An example is the following:

  • PTS = α0 + α1EXP + α2MIN + u
  • REB = β0 + β1EXP + β2MIN + β3HT + v

Let us assume each basketball player’s points are a function of only their years of experience (EXP), the number of minutes they play per game (MIN) and a constant. The number of rebounds they get per game is also a function of a constant, EXP, and MIN but the person’s height (HT) also affects their rebounding totals. This system of equations would be the same as:

  • PTS = α0 + α1EXP + α2MIN + α3HT + u
  • REB = β0 + β1EXP + β2MIN + β3HT + v

where α3 was constrained to be 0. The fact that α3 is constrained to be 0 is our exclusion restriction. SUR uses a typically instrumental variables approach but our vector of instruments, z, is equal to the union of the regressors from all equations.

  • z=union of (x1,…xM)

In this example, M=2 so: x1=(1, EXP, MIN)’; x2=(1, EXP, MIN, HT)’; z=(1, EXP, MIN, HT)’. Our orthogonality conditions are that E(zu)=0 and E(zv)=0. Our parameter estimates become:

  • δ= …[σ11A11 , σ12A12 ]-111c11 , σ12c12 ]
  • ……..[σ21A21 , σ22A22]….[σ21c21 , σ22c22]

Amh=n-1Σi ximxih.

cmh=n-1Σi ximyih.

OLS can also be used because the regressors are predetermined. In fact, if each equation is just identified, SUR is mathematically equivalent to OLS. If at least one equation is overidentified—which would be the case in the first (PTS) equation in our example—then SUR is more efficient than equation-by-equation OLS.

For more information of Seemingly Unrelated Regressions, see Hayashi (2000) Econometrics, pp. 279-283.

One estimation procedure preformed by many novice economists is to use OLS to regress quantity on price. Let us assume the following framework (omitting the i subscripts on the variables):

  • qd = α0 + α1p + u
  • qs = β0 + β1p + v
  • qd = qs

If we regress qd on a constant and p in order to try to estimate the demand equation for some good, the OLS estimate of α1 is given by the formula α1OLS =Cov(p,q)/Var(p). I solve Cov(p,q) below:

  • Cov(p,q)=Cov(p, α0 + α1p + u)
  • = E(α0p + α1p2 + pu) – E(p)*E[α0 + α1p + u]
  • = α1Var(p) + Cov(p,u) [1]

To find Cov(p,u) we can solve the first system of equations above.

  • p= [(α0 - β0) + (u - v)]/(β1 – α1)
  • Cov(p,u)= Var(u)/(β1 – α1) [2]

So, substituting [2] into [1], we have:

  • Cov(p,q)= α1Var(p) + Var(u)/(β1 – α1)

Thus, our bias term for the OLS regression is:

  • Cov(p,q)/Var(p) – α1 = Cov(p,u)/Var(p) [3]

Since we see in equation [2] that Cov(p,u) is not equal to 0 unless Var(u) = 0—which is unlikely—we know the OLS estimate is biased. This phenomenon is known as simultaneous equation bias or endogeneity bias. The problem is that the error term (u) is correlated with the independent variable (p). The main way to solve this problem is to use an instrumental variables methodology.

If you think creating a survey which will compel respondents to answer in an unbiased manner is easy, check out this article originally published in the Wall Street Journal in February (“Census 2010 plays six not-so-easy questions“). The six questions proposed to be asked in 2010 Census short-form questionnaire are as follows:

  1. Name of person
  2. How is this person related to Person 1*? [Person 1 is defined to be the head of household]
  3. What is this person’s sex?
  4. What is this person’s age and what is this person’s date of birth?
  5. Is this person of Hispanic, Latino or Spanish origin?
  6. What is the person’s race?

These seems pretty self explanatory, right? Well the questions are not as clear as they seem. Examples of problems from each category are below.

  1. Name: This field can be confusing for migrants. Chinese names are written with the surname name first and the given name last (e.g.: Yao Ming should be formally addressed as Mr. Yao). Latin-American immigrants typically have two Spanish surnames, one from the father’s family name and one from the mother’s family name.
  2. Relationship: Respondents can choose among 14 possible answers regarding their relationship to the head of household, but a 15th answer–foster child–has been deleted since the 2000 census. How are these poor foster kids going to respond to the 2010 census?
  3. Sex: While this field seems the most self-explanatory, in the 2000 census 0.05% of respondents (or 150,000 of 300 million Americans) checked both the male and female boxes.
  4. Age: According to the WSJ, “Question No. 4 asks age — and for a computer double-check, date of birth — because so many people seem to get it wrong. Adding instructions to ‘report babies as age 0′ when they’re less than a year old, offends some people, census research suggests. But in the 2005 trial it improved the response rate among people who otherwise couldn’t decide how to answer for a six-month old.”
  5. Latino: (see “Race” below)
  6. Race: Again from the WSJ, “But in trial tests, the Census Bureau also found that Asian and Hispanic immigrants could be baffled when asked to lump themselves with other nationality groups. ‘The whole concept of being Latino is a very American construct,’ says Mr. Vargas. ‘People might not know what’s being asked of them.’ Under a 2005 order from Congress, question No. 6 also allows people to call themselves ’some other race’ and identify that race on a fill-in line. In census tests, respondents declared themselves Creole, Aryan, rainbow and cosmopolitan, among others. Other federal data users, like Social Security and the federal Education Department, don’t recognize those races, though. So in data that the Census Bureau will send to those departments, the bureau will impute a race. ‘Maybe I get it right and maybe I get it wrong. It’s not something I like to do,’ says Mr. Waite.”

To sum up, designing a good survey instrument is harder than you think.

Today we will look at some common distributions used for Bayesian inference.

Beta

The first distribution we will look at is the Beta distribution. The beta distribution is equal to: [B(a,b)]-1πa-1(1-π)b-1. We can show that:

  • If the prior ~ πa(1-π)b
  • And likelihood ~ πS(1-π)F
  • Then the posterior ~ πa+S(1-π)b+F

Where ‘~’ denotes equal except for a constant or proportional to. If:

  • S* = a + S + 1
  • F* = b + F + 1

Then

  • n* = S* + F*
  • P* = S*/n*

We can use P* and n* just like we would p and n in a classical binomial framework. P* is our expected mean and the Bayesian confidence intervals can be calculated as follows.

  • Conf. Int.: P* +/- (tα/2)*[P*(1-P*)/n*]1/2

Normal Distribution

With the normal distribution, we will again have a prior normal distribution and a likelihood function. The question is, which should we rely on more in creating our posterior distribution: our prior assumptions or the data collected.

Let the variance of the prior be σ20 and the variance of the sample be σ2. Also let μ0 be the prior mean and X* be the sample mean. If ‘n‘ is the number of observations in the sample, we can calculate the posterior mean as:

  • Posterior mean = (n0μ0+ nX*)/(n0 + n)
  • where: n0 = (σ2)/(σ20)

We can see that the prior mean is more important when the prior’s variance is small relative to the sample variance. On the other hand, the sample mean is given more wieght when the sample variance is small relative to the prior variance. We can calculate the posterior standard error as:

  • Posterior S.E. = σ/(n0 + n)1/2

Now we will give an example of Bayesian inference in a more complicating setting. This example is based on a problem from pp. 588-591 of Introductory Statistics for Business and Economics by Wonnacott and Wonnacott.

Let us assume that there is a consumer electronics company named Banana, inc.. Banana sells iPood mp3 players. Banana, inc., however, has quality control problems and some of the truckloads of iPoods are defective. The proportion of defective iPoods in each truckload is as follows:

Prior distribution of π
% Defective Nbr. of Shipments % of Shipments
(1) (2) (3)
0% 2 1%
10% 30 15%
20% 40 20%
30% 42 21%
40% 34 17%
50% 26 13%
60% 16 8%
70% 8 4%
80% 2 1%
90% 0 0%
100% 0 0%
  200 100%
     

CircuitVillage recieves a truckload of iPoods from Banana, inc. They decide to take a random sample of n=5 iPoods out of the truckload in order to get sample evidence on π (the proportion defective in this truckload). CircuitVillage finds that 3 of the 5 iPoods are defective. What is the posterior distribution of π?

We can calculate the likelihood function using the binomial distribution. The binomial probability function is as follows:

P(k out of n) =
n!

k!(n-k)!
(pk)((1-p)n-k)

We know that k=3 and n=5. And thus we can find the liklihood function that p=π.

Calc. to obtain posterior dist.
  Likelihood of Pi Prior x Likelihood Posterior
(1) (4) (5) (6)
0% 0.000 0.000 0.000
10% 0.008 0.001 0.008
20% 0.051 0.010 0.064
30% 0.132 0.028 0.172
40% 0.230 0.039 0.243
50% 0.313 0.041 0.252
60% 0.346 0.028 0.172
70% 0.309 0.012 0.077
80% 0.205 0.002 0.013
90% 0.073 0.000 0.000
100% 0.000 0.000 0.000
    0.161 1.000

Column 4 is found by simply plugging the first column value in for p into the binomial probability function where k=3 and n=5. Column 5 is found by multiplying column (3) by column (4). To normalize the distribution so that the probabilities sum to 1, we must divide by the sum of column five (0.161) and thus we have the posterior distribution.

We can ask ourselves what the probability is that less than 25% of the iPoods are defective in the shipment are defective. According to our prior, we would believe that 36% (.01 + .15 + .20) of the truckloads contain iPoods where less than 25% of them are defective. After collecting more information and observing that 3 of the 5 iPoods sampled are defective, our posterior distribution now says that it is less likely that the iPood shipment has a low defect rate. In fact, there is only a 7% chance (0 + 0.01 + 0.06) chance that less than 25% of the iPoods from Banana, inc. are defective according to our posterior.

Bayesian Inference is an important econometric tool. Over the next few days, we will review some of the basic Bayesian inference methods.

Economicitis occurs in 300 out of every 100,000 adults. Recently, however, a test has been developed to screen for the disease. Of 1000 individuals with economicitis who were tested, only 40 had an erroneous negative test. Out of 1000 healthy individuals, 20 out of 1000 individuals had an erroneous positive test result.

My friend Ron received the sad news that his test result shows that he has economicitis. Ron wants to know that given the test result is positive, what is the actual chance that he has the disease.

One way to estimate this is using Bayesian inference. According to Bayesian theory:

  • Posterior Odds = prior odds x likelihood ratio
  • Posterior Odds=p(θ1)/p(θ2) * [p(X11)/p(X12)]

The prior odds are having the disease are 300/99,700. This is equivalent to the prior probability Ron has the disease (300/100,000) divided by the prior probability he does not have the disease (99,700/100,000). The likelihood ratio is equal to the probability of having a positive test given the person has the disease (1-40/1000) divided by the probability of having a positive test given that the person is healthy (20/1000). Thus we have:

  • Posterior Odds = (300/99,700) x [(960/1000)/(20/1000)] = .144

This means that the chance the individuals who test positive for economicitis actually have the disease is about one in seven. To calculate the posterior probability, simply use the following formula:

  • Posterior Probability = (posterior odds)/(1 + posterior odds)=.144/1.144=12.6%

Thus, Ron should not be too worried about having the disease. Using the prior, Ron only had a 0.3% change of having the disease, but even after having tested positive for economicitis, Ron still only has a 12.6% chance of being stricken by this deadly disease.

Today I will review a few basic concepts of time series econometrics. A time series is a stochastic process where observations appear in different time periods. For instance, {zi} (i=1,2,3,…) is a stochastic process with zi representing the GDP each quarter. Below are a few important definitions which are important to econometric estimation using time series data.

  • Covariance Stationary Processes. A process is covariance stationary if i) E(zi) does not depend on i, and ii) Cov(zi,zi-j) exists, is finite, and depends only on j but not on i.
  • White Noise. A covariance stationary process, {zi}, is white noise if E(zi)=0 and Cov(zi,zi-j)=0 for j0. We typically assume that the error term in most estimating equations is white noise.
  • Ergodicity. A stationary process is said to be ergodic if:
    • limn->∞|E[f(zi,...zi+k)g(zi+n,...zi+n+l)]| =|E[f(zi,...zi+k)||E[g(zi+n,...zi+n+l)]|
    • This means that as the observations from time series become further and further apart, they become independent.

    • E(xi|zi-1,zi-2,zi-3,…z1)=xi-1.
    • This means that given all the information from the past, our best guess at the value of x in this time period is the value of x last time period. For instance, Hall’s Martingale Hypothesis states that given a variety of macroeconomic variables, my best guess of aggregate consumption this quarter is equal to aggregate consumption last quarter.

    • z1=g1
    • z2=g1+g2
    • zi=g1+g2+…+gi
  • Martingale. Let xi be an element zi. The scalar process {xi} is a martingale with respect to {zi} if:
  • Random Walk. A random walk is a specific type of martingale made up of the sum of a white noise process. Let {gi} be a white noise process. Then a random walk process {zi} would equal the following:

For further information, see: Hayashi, Fumio (2000) Econometrics. Princeton University Press. USA.

One of the basic concepts in statistics is the use mathematically rigorous tests to determine whether or not a researcher can reject their null hypothesis. The null hypothesis is the state of the world the researcher assumes exists. The alternative hypothesis is—as the name suggests—an alternative to the null hypothesis. Through these statistical tests, researchers try find the truth regarding a certain phenomenon. The degree of certainty the investigator has in his or her conclusion depends on the amount of type I and type II error in their calculations. Type I error occurs when the null is incorrectly rejected; type II error occurs when we fail to reject the null, when in fact the alternative is true. Below are more concrete examples of type I and type II errors.

Criminal Justice

In the criminal justice system, defendants are assumed to be innocent until proven guilty. Thus, the null hypothesis is that the individual is innocent while the alternative is that the defendant actually committed the crime. A type I error would occur if the individual was convicted of a crime they didn’t commit. A type II would presents itself when a guilty man is set free.

Clinical Drug Trial

In the case of a clinical test for a new pharmaceutical, the null hypothesis would be that a new drug (drug N) is no better than the current drug (drug O). On the other hand, the alternative hypothesis would state that drug N is superior to drug O. A type I error would conclude that the new drug is better than the drug O, when in fact it is not. A type II error would conclude that the new and old pharmaceuticals are equivalent when in fact the drug N is superior.

Ordinary Least Squares 

If you have studied basic statistics, its likely that you have come across the ordinary least squares (OLS) estimation technique.  OLS attempts to minimize the squared distance between dependent variables (‘y‘) and the a linear prediction of y (y_hat=xβ).  The parameter vector ‘β_ols‘ minimizes this distance.  The most important assumption in order for β to reflect to true parameters in the population is for the regressors to be uncorrelated with the error terms (cov(x,e)=0).  Sometimes this is not the case.  The assumption fails if:

  1. There are omitted variables which are correlated with the regressors (x)
  2. We have a system of simultaneous equations.
  3. There is an errors in variables problem
  4. The system has a lagged dependent variable with a serially correlated disturbance

Instrumental Variables

One solution to these problems is to use an Instrumental Variables (IV) technique.  (Click here for an explanation of IV).  A question remains as to when OLS is appropriate and when IV is best.  OLS will generally give smaller standard errors (and thus is more precise) and is to be preferred when the β_OLS parameters are unbiased.  

Hausman Endogeneity Test

To test whether the IV or OLS regression technique is best, one can use the Hausman endogeneity test.  Let us try to estimate the following equation:

  • (1) y1 = x1*δ + y2*α + e

Let the vector z=(x1,x2) be the set of all exogenous variables.  The vector x1 is the set of regressors and x2 are our instruments.  Since z is exogenous, we know E(z′*e)=0.  The variable y2, we believe to be endogenous. 

One example of an endogenous y2 would be a wage equation where y1 is the individual’s wage and y2 is the number of hours worked.  We would think that full time workers would earn more than part time workers so hours would affect wage.  On the other hand, when a worker’s wage is higher (assuming leisure is a normal good) one would expect the individual to work more hours.  In this example we have dual causation.

To conduct the Hausman test, we first find the linear projection of y2 on z using OLS.

  • (2) y2 = z*π + v

Since the error term from the first equation (‘e‘) is uncorrelated with z by assumption, then y2 is endogenous if and only if E(v*e)≠0.  We can test whether the structural error ‘e‘ is correlated with the reduced form error (‘v‘) using the following equation:

  • (3) e = Ï?*v + u

If we plug equation 3 into equation 1 and we have:

  • (4) y1 = x1*δ + y2*α + Ï?*v + u

In empirical data, however, ’v‘ is not observed.  Nevertheless, we can estimate ‘v_hat‘ by taking the saved residuals from our OLS regression in equation 2 and plugging these numbers into equation 4 for ‘v’.  The final equation is:

  • (5) y1 = x1*δ + y2*α + Ï?*(v_hat) + u

We can now consistently estimate δ,α, and Ï? using OLS.  Using the usual OLS t-statistic, we can test the null hypothesis that Ï?=0.  If we accept the null, then there is no endogeneity problem and one should use an OLS estimation strategy.  If Ï?≠0, then the instrumental variables technique is best.  One can also use a heteroskedasticity-robust t-statistic for testing Ï? if one suspects heteroskedasticity. 

A similar set of procedures can be extended to the case where y2 is a vector.  Instead of an t-test on the residual ‘v_hat‘, in the vector case we would have to preform an F-test (Ï?=0) on a vector of residuals ‘v_hat‘.  To see how to preform Hausman tests in the Stata statistical package, look at this paper by Baum, et al.

Summary

  1. Preform first stage regression of the endogenous variable (y2) on z.
  2. Calculate the residuals from this equation and include them as an additional regressor in the original estimation equation.
  3. Run OLS on this new equation and preform a t-test for the coefficient on the first stage residuals.
  4. If one accepts the null hypothesis, then there is no endogeneity problem and OLS should be used.  If one rejects the null hypothesis, then endogeneity is a problem and one should use an IV estimation strategy.

References

Wooldridge, Jeffrey; Econometric Analysis of Cross Section and Panel Data, MIT Press, London, (c) 2002, pp. 118-122.  

What is one to do when the dependent variable under investigation is categorical?  Well if these categories are ordered, then an ordered probit (or logit) estimation technique is a sensible means for estimation.  An example where ordered probit estimation should be used is for an integer index ranking of physician quality between one and five.    On the other hand, if the dependent variable is the number of surgeries a patient has, a Poisson estmation methodology would be best since ’y’ is a count variable. 

Let us continue with the physician ranking example.  Suppose there are three ranking categories: excellent (2), average (1), and poor (0).  We assume there is a latent variable y* which is a function of a vector of covariates (‘x‘).  The latent variable determines which category the physician falls into.

  • y* = + ε; ε|x~N(0,1
  • y=0 if  y*<α_1
  • y=1 if  α_1
  • y=2 if  y*>α_2

Now we can calculate the probabilities that a physician will fall into each category.

  • P(y=0|x)=P( + ε<α_1)=P(ε<α_1- )=Φ(α_1-)
  • P(y=1|x)=P( + ε<α_2) - P( + ε<α_1) = Φ(α_2-)-Φ(α_1-)
  • P(y=2|x)=P( + ε>α_2)=1-Φ(α_2-)

Using maximum likelihood estimation, we can now derive the α and β parameter vectors.  The log-likelihood function becomes:

  • l(α,β)=1{y_i=0}log[Φ(α_1-)] + 1{y_i=1}log[Φ(α_2-) - Φ(α_1-)] + 1{y_i=2}log[1-Φ(α_2-)]

If we instead assume that the cdf of ε|x is ‘exp()/[1+exp()]‘, then we can use the logit model instead. 

The end statistic of interest is P(y=j|x).  This can be calculated as follows:

  • ∂p_0(x)/∂x_k= -β_kφ(α_1-)
  • ∂p_1(x)/∂x_k= β_k[φ(α_1-)-φ(α_2-)]
  • ∂p_2(x)/∂x_k= β_k[φ(α_2-)]

For more information on ordered probits, see the Tokyo Climate Center’s ordered probit explanation as well as the treatment in Econometric Analysis of Cross Section and Panel Data (pp. 504-509) by Wooldridge.

Many data sets that social scientists come across use disproportionate stratified sampling. If a subpopulation is small, the survey designers may want to oversample this group. For example, in the Survey of Income and Program Participation (SIPP) poor individuals are oversampled and in the Community Tracking Study (CTS) uninsured individuals are oversampled in order to give more precision to the estimates made for these groups with a smaller population. Below, is a brief explanation of how to work with a disproportionate stratified data set.

Simple Example (from a Napier University website)

Lets us imagine a town which has 1200 rich people and 2500 poor people. Due to budget constraints, the survey designer samples 100 people from each of the two strata (200 people total). The sampling fraction for the rich is .08333 (100/1200) and for the poor is .04 (100/2500). The weights to be placed on each observation is simple just the inverse of the sampling fraction; thus the weights are 12 for the rich and 25 for the poor.

In the example above, suppose the mean household income in the poor areas was £12,000 and that in the rich areas was £25,000, then the weighted mean would be

[100x £12,000 x (w=25) +100 x £25,000 x(w=12)] ÷ (100×25+100 x 12) = £16,216.20.

An unweighted mean here would just be £18,500, so we can see that the weighting has corrected the fact that the sample has too many rich households.

Econometrics (see Wooldridge pp. 590-598)

Here is how Wooldridge explains variable probability sampling:

  1. Draw an observation w_i at random from the population
  2. If w_i is in stratum j, toss a (biased) coin with probability ‘p_j‘ of turning up heads. Let h_{ij}=1 if the coin turns up heads and zero otherwise.
  3. Keep observation i if h_{ij}=1; otherwise leave out of the sample.

A weighted M-estimator would be:

  • min _{β} SUM_{i=1 to N} [p_{j_i}]^{-1}*q(w_i,β)

Here q(w,β) is the objective function that is chosen to identify the population parameters β_o. In the OLS case, q(w_i,β)=x*(y-). The asymptotic variance matrix for the linear model is:

  • [SUM_i p^{-1}x'x]^{-1} [SUM_i p^{-2}(u^2)x'x] [SUM_i p^{-1}x'x]^{-1}

where all variables are to have subscript i’s except p, since p=p_{j_i}.

Stata

A simple and clear example of how to use weights in a stratified sample can be found at the UCLA Academic Technology Services website (Stata FAQ: How do I use the Stata survey (svy) commands?“). There are three main variables which need to be definied.

  1. The primary sampling unit (psu) is the lowest unit of observation, usually either an individual or a household identification number. In the econometric section above, psu’s are indexed by the letter i.
  2. The strata are the groups into which the data set is divided. The strata are indexed by the letter j. In the first example above, there are two strata: rich households and poor households.
  3. The sampling weight is defined as the inverse of the psu’s probability of selection.

To program this into stata, if the we would write:

  • svyset [pweigh=wt], psu(house) strata(eth)

Here the psu is the variable house, the strata are categorized by eth (a variable for the ethnic group) and the weight is the variable wt. To run a weighted least squares regression (WLS), you would simply type:

  • svy: regress y x1 x2 x3

and the appropriate weighting will occur.

The Poisson distribution is one that is often used in health economics.  Wikipedia has a nice basic summary of the Poisson distribution; Wolfram MathWorld gives a more sophisticated analysis.  The distribution is

f(k;\lambda)=\frac{e^{-\lambda} \lambda^k}{k!},\,\!

where ‘λ‘ is equal to the number of expected occurrences in a period.  The distribution expresses the probability of a number of events (‘k‘) occurring in a fixed period of time if these events occur with a known average rate, and are independent of the time since the last event.  The variance and the mean for a Poisson distribution are the same.  Healthcare economists can use the distribution to determine how different variables (eg: income, smoking, medical treatments) affect the probability of observing the occurrence of a certain number of events (eg: illnesses, deaths, etc.). 

Let’s look at an example:

In the US in 2000, there were approximately 15 million people aged 55-59 years of age.  The death rate for this age group was approximately 750 deaths per year per 100,000 individuals.  Thus, the expected number of deaths (‘λ‘) per year is 112,500.  What factors impact the lambda term?  An economists may be interested in whether increasing average real income for the cohort would increase or decrease the death rate.  We can model lambda as a function of other covariates (‘x‘) such as the whether or not the individual has health insurance, if they are a smoker, and their income level and their corresponding coefficients (‘β‘).  Thus our new equation is:

  • f(k;x)=exp[-λ(x;β)] * [λ(x;β)]^k / k!

If we have observations from multiple census years (and if we assume that the 55-59 age cohort is of the same size each year), we can estimate this coefficients (β) using a log likelihood function:

  • l_i(β)=k_i *log [λ(x;β)] - [λ(x;β)]

The ‘k!‘ term drops out because it does not depend on the parameter β. Each observation (‘i‘) corresponds to data from each census year in the sample.  The most common form for λ(x;β) to take is λ(x;β)=exp().  If the variable x_j is continuous and we assume λ(x;β)=exp(x;β), then we can show:

  • ∂{E(k|x)} / ∂{x_j} = exp()*β_j
  • β_j = ∂{log [E(k|x)]} / ∂{x_j}

Now that we know β_j, the economists knows the impact that any covariate ‘x‘ (such income or smoking, or health insurance) will have on the average death rate (λ).

Jason has insurance and his brother Nosaj does not.  Jason utilizes more medical services than Nosaj.  Is this situation occuring because Jason is truly sicker than Nosaj (adverse selection), or is this because since Jason has insurance, medical services are cheaper for him than Nosaj (moral hazard)?  Disentangling the problems of moral hazard and adverse selection is what Susan Ettner sets out to do in her 1997 JHE paper “Adverse selection and the purchase of Medigap insurance by the elderly.” 

Ettner motivates her paper by dividing individuals into four groups:

  • Group A: High propensity to use services; Employer does not offer Medigap coverage
  • Group B: Low propensity to use services; Employer does not offer Medigap coverage
  • Group C: High propensity to use services; Employer does offer Medigap coverage
  • Group D: Low propensity to use services; Employer does offer Medigap coverage

Ettner assumes that groups A, C and D will purchase Medigap and group B will not.  If individuals choose employment based on criteria apart from the quality of the firm’s health plan offerings, then groups C and D combined will accurately represent the population as a whole.  We can calculate the amount of adverse selection by taking the difference between the average utilization of group A versus groups C and D combined.  Moral hazard can be calculated by comparing the average utilization of group B versus groups D; however this not possible since empirically it will be impossible to separate out groups C and D.  Ettner says that comparing group B versus groups C and D combined will lead to an overestimate of moral hazard, but this will still be less biased then an estimator comparing uninsured (B) vs the insured (A, C, and D).

Data and Methodolgy

Ettner uses data from the 1991 Medicare Current Beneficiary Survey (MCBS) and runs a multinomial logit regression comparing individuals with employer Medigap, individual Medigap and Medicaid only policies.  In addition to usual demographic and socio-economic variables, Ettner uses state-level variables such as SSI income standard, a cost-of-living index and the price of the most comprehensive Medigap policy in the state. One problem is that individuals with a high propensity to consume medical services may elect not to choose Medicaid coverage since they know that if they become sick, they can ex post sign up for Medicaid and be covered.

The author proceeds to estimate moral hazard and selection effects on resource use.  The expected value of using a service ‘Y’ can be written:

  • E(Y)=P(Y>0)*E(Y|Y>0)

The probability term is estimated using a probit model and the conditional expected value term is estimated using OLS.  In this part of the paper, Ettner enriches the analysis by subdividing the regressions into basic Medigap and enhanced Medigap (eg: those with nursing care and/or prescription drug coverage). 

Results

Ettner finds that overall, adverse selection is not a significant factor in the purchase of Medigap insurance.  There is some evidence of adverse selection (those with cardiovascular or musculoskeletal problems are more likely to purchase Medigap), but there is also evidence of favorable selection (individuals who are smokers or who rate their health to be poorer are less likely to buy Medigap policies). 

Regarding the utilization results, Ettner finds that the enhanced Medigap comparisons showed stronger moral hazard effects than those with basic Medigap policies.  If adverse selection is not controlled for, the paper demonstrates that the moral hazard estimates are biased upwards.

Ettner, Susan; (1997) “Adverse selection and the purchase of Medigap insurance by the elderlyJournal of Health Economics, Vol 16, pp. 543-562.

How much would you be willing to pay for a cancer treatment with a 2% chance of working?  How much would you be willing to spend for a new vaccine that was as effective as a prior vaccine, but was now available in chewable tablets?  One way to answer questions regarding new products or goods where markets don’t exist is to use the contingent valuation method (CVM).  In general, CVM is a survey which presents individuals with a hypothetical situation and solicits these individuals for a value of how much they would be willing to pay (WTP) for the good in question.  CVM was orginially used to value environmental laws changes as well as new urban transportation systems.  On of individuals at the forefront of this field is Richard Carson, an economist at UC-San Diego. 

In a 1999 Health Policy article, Thmoas Klose analyses CVM in the medical field.  Below are different CVM methodologies:

  • Open-Ended: Here a person is asked a general question of how much they would be willing to pay for the good.  In practice, however, these questions often lead to very inaccurate responses.
  • Take-it or Leave-it (TIOLI):  In this method, an individual is presented with an option of paying $X for the good.  The benefit here is that the survey participant is faced with a concrete decision.  On the negative side, it often takes a large sample size to achieve accurate estimates.
  • Bidding Games: Using a computer, a survey participant will be asked a question as in the TIOLI situation.  If they answer ‘yes’ to a WTP of $X, subsequent questions will ask if the individual would be willing to pay $Y for the good where Y>X.  If the answer to the TIOLI question in the first stage is ‘no’, then subsequent questions will ask the individual wold pay $Z for the good where ZStarting point bias (the amount $X may be seen as a ‘reasonable’ amount by the individual and they may offer an inaccurate WTP) can cause problems.
  • Check box: This is similar to an open ended question, but individuals get to check a box which corresponds to the maximum they would be willing to pay for the good.  This method has the problem of range bias, where respondents tend to pick values in the middle of the range of boxes.

There are two other important biases to note which I did not mention above.  The first is question order bias.  Some people will offer a high WTP when question are in a certain order, but when the order changes their WTP may also change.  There may be suggestive elements in the wording of the question which may cause a problem.  Also, response effects need to be taken into account.  Response effects occur if an individual answers a question strategically.  For instance, someone with cancer may answer ‘$10m’ to the question ‘How much would you be willing to pay for a cancer treatment with a 2% chance of working?’, because they believe this would influence drug companies to produce the treatment.

Another issue to take into account is the payment method.  Is the person paying for the treatment through a co-pay, through increases in insurance premiums in order to cover the treatment, or through increased taxes?  Also, what is the timing of the payment.  Is it ex-ante (before they get a disease), intermediate (in the diagnosis stage) or ex-post (after they already have the disease)? 

CVM is a very useful tool, but one that must be used with caution.  Careful design is imperative in order to be able to reach valid conclusions from a CVM survey. 

  • Klose (1999); “The contingent valuation method in health care” Health Policy, Vol 49, pp. 97-112.

Other articles of interest regarding CVM in the medical field:

  • Popper, Carol (1990); “Contingent Valuation of time spent on NHS waiting lists,” The Economic Journal, Vol 100(400), pp. 193-199.
  • Diener, O’Brien, Gafni (1998); “Health care contingent valuation studies: a review and classification of the literature,” Health Economics, Vol 7, pp. 313-326.

Throughout the past week, I have spoke of the work disincentives many social security programs create.  The question is: how do we measure these disincentives.  The economics literature has given three different metrics to measure implicit social security wealth a retiree has and I will discuss each in turn.

Accrual

The accrual method measures how much the value social security benefits would increase (or decrease) if a person decides to postpone retirement by one year at age ‘t‘.

  • Accrual=SSW{t+1} – SSW{t}

SSW_{t} is one’s implicit social security wealth if they retire at age ‘t’.  This is calculated by taking the nominal benefits at each future date and multiplying them by the probability of surviving to that date as well as a discount factor. 

Take the example below:

 

Age Benefit P(Surv) NPV factor Value
64 1000 1 1.00 1000.00
65 1000 0.75 0.94 795.00
66 1000 0.5 0.89 561.80
67 1000 0.25 0.84 297.75
68 1000 0 0.79 0.00
        2654.55

Here, a pension for someone who retires at age 64 is $1000 per year.  I assume that everyone dies at age 68.  The total value of the individuals SSW is $2654.  What if the person decided to postpone retirement to age 65?

 

Age Benefit P(Surv) Discount Value
64 0 1 1.00 0
65 2000 0.75 0.94 1590
66 2000 0.5 0.89 1123.6
67 2000 0.25 0.84 595.51
68 2000 0 0.79 0
        3309.11

In my example, the individual receives $2000 per year if they retire at age 65 (instead of 64).  After taking into account the probability of surviving to each age as well as the discount factor, the person’s new SSW is $3309.  Thus the accrual amount is $655.  By postponing retirement, the individual increases their Social Security Wealth.  In many systems, this amount can be a large negative amount which gives individuals an incentive to retire early.

Peak Value

The peak value calculation is similar to the accrual method. 

  • Peak value=SSW{r*}-SSW{t}

This method takes the SSW at the age of retirement (‘r*‘) where r* is the age of retirement which maximizes the value of social security benefits and subtracts the SSW which would result from retirement this year.  This method may be optimal to capture the fact that many people retire at a target year and don’t calculate their own incentives each year.

Option Value

The option value is the most sophisticated method.  It takes into account utility from not working, but requires the researcher to model a utility function and assume parameter values.  First, let us calculate the value of retirement (V)at date ’s’.  This is equal to the discounted expected utility of wages earned between the current date t and date ’s-1′ plus the discounted expected utility of social security benefits earned between date ’s’ and the end of life.

The option value is equal to: V(r*)-V(t).  This is the difference between the discounted expected utility from retiring at r* (the date that maximizes utility) and the discounted expected utility from retirement today (date ‘t’).  If the option value is positive, the individual will continue to work.  If the option value is negative, the individual will retire. 

A more comprehensive treatment is given in Stock and Wise (1990).

Stock and Wise (1990) “Pensions, the Option Value of Work and Retirement”, Econometrica, Vol. 58(5), pp. 1151-1180

Under the Balanced Budget Act of 1997, the Federal government established the State Children’s Health Insurance Program (SCHIP), which was aimed at reducing the number of uninsured children in the United States. States were given a variety of options of how to implement this program. Nineteen states decided to operate the SCHIP program as an extension of Medicaid (M-SCHIP), fifteen states operated stand-alone programs (S-SCHIP) and 17 states used both approaches.

Researchers often use a variety of regression methods to test for the impact of government programs on various variables. A common approach in this case is to use an ‘eligibility’ variable to test for a change in insurance coverage. However, a Rosenbach, Ellwood, Czajka, Irvin, Coupe and Quinn (2001) paper gives one pause as to the effectiveness of such a simple approach.

For instance Minnesota’s M-SCHIP program extended insurance benefits to less than 100 individuals. New York’s S-SCHIP program had an enrollment of over 500,000 children. What accounts for these differences?

Prior to Title XXI–the SCHIP legislation–Minnesota already had a generous public insurance benefit for children under their Medicaid system. Children at or below 275% of the poverty line were eligible for Medicaid insurance prior to the national legislation. After the legislation, a child had to be at or below 280% of the federal poverty line–an insignificant change.

New York also had a children’s insurance benefit (CHPlus) before Title XXI came into effect. CHPlus granted insurance to children below a less generous threshold, ranging between, 100% to 192% of the federal poverty line. The state, however, rolled over all CHPlus participants (over 170,000 individuals) into their new S-SCHIP program.

Thus, it is imperative that a researcher not assume that pre-SCHIP benefit levels in each state are comparable, or else one will reach erroneous conclusions.

Programs such as Medicaid and Medicare aim to expand health insurance to those currently uninsured.  These programs certainly accomplish this goal, but they also crowd out private insurance.  This means that an individual who has private health insurance may decide to use public insurance instead.  This would mean that society is simply substituting individual payment of insurance for government payment of insurance.

Culter, Gruber (1996) provide the seminal work on this subject.  They examine Medicaid expansions in the late 1980s and early 1990s.  During this period there were large increases in the Medicaid eligibility, specifically for pregnant women and children in (relatively) higher income families.  Cutler and Gruber find that an increase in Medicaid insurance coverage for 100 children will result in a decrease in private insurance for 31 of these children (31% crowdout); for adults, increasing coverage for 100 people will reduce the number of people with private insurance by about 49 individuals (49% crowdout).  One would guess that these effects are so large because they estimate the impact of Medicaid expansions, and thus many of these potential participants would have already had insurance.  Crowdout is presumably much lower for those at the lowest end of the income distribution.

Estimation

Using the 1988-1993 March Current Population Survey (CPS), Gruber and Cutler aim to estimate the following equation:

  • COV=B_1*Elig + B_2*X + a_s*state + a_t*time+e
    • COV‘ is a dummy for Public, Private, or No insurance; ‘Elig is a dummy for eligibility; ‘X is a vector of demographic variables; ‘state and ‘time are dummy variables for specific states and years.

One problem, however is that ‘Elig‘ is an endogenous variable; people may choose how many hours to work in order to be able to participate in the Medicaid program.  In order to account for this, Cutler and Gruber use an instrumental variable approach.  The create a new variable ‘SimElig‘ by selecting a national random sample of 300 children of each age in each year and 3000 women of child-bearing age in each year.  They then assign the same sample to each state in that year and compute the average eligibility for the group in each state.  Thus, ‘SimElig‘ will vary only through differences in state legislation over time and not by the composition of Medicaid participants in each state.

Health Outcomes

One important question is whether or not having Medicaid insurance improves health outcomes.  While this article does not tackle this issue explicitly, it cites other studies that have.  Currie and Gruber (1995) find that Medicaid eligibility increases for children were associated with increases in medical care utilization and health improvements.  On the other hand, Piper, Ray and Griffin (1990) find no health benefits from Medicaid expansions in Tennessee, and Newhouse, et al. (1993) conclude that there is no significant health differences resulting from more or less generous insurance.

Policy Suggestions

Cutler and Gruber suggest the following policies:

  1. A sliding scale subsidy for the purchase of insurance.  People would receive a voucher for purchase of insurance which they could use either towards payment for private health insurance or they could simply use it towards a contribution towards a Medicaid program.  The subsidy would gradually decline as income increased.
  2. A waiting period could be imposed between when a person loses private insurance and when they become eligible for Medicaid.  This would discourage individuals from quickly switiching to free Medicaid insurance at the expense of private insurance.  Many states have recently adopted this approach, especially for the State Children’s Health Insurance Program (SCHIP)
  3. They do not support directly subsidizing hospitals or medical providers for the care of the poor or creating a national health insurance scheme.
Source: Cutler, David; Gruber; Jonathan (1996); “Does Public Insurance Crowd Out Private Insurance,” Quarterly Journal of Economics, Vol 111, No 2, pp. 391-430.

Difference in Difference (DD) is a commonly used empirical estimation technique in economics. Let us take a hypothetical example where a state (Wisconsin) passes a bill which makes employer-provided health insurance tax deductible. Let us also assume that in the year after the bill passed (year 2) the percentage of firms offering health insurance increased by 50% compared to the year before the bill was passed (year 1). In order to estimate the impact of the of the bill on the percentage of firms offering health insurance, we could simply do a ‘before and after’ analysis and conclude that the bill increased insurance offerings by 50%. The problem is that there could be a trend over time for more employers to offer insurance. It is impossible to identify if the tax deductibility or the time trend caused this increase in firm offering.

One way to identify the impact of the bill is to run a DD regression. If there is a state (California) that did not change the way it treated employer provided health insurance, we could use this as a control group to compare the changes between Wisconsin and California between the two years.

We will run the regression:

Y=β_0 + β_1*T + β_2*WI + β_3*(T*WI) + e

Y is the percentage of firms offering health insurance in each state in each time period. T is a time dummy, WI is a state dummy for Wisconsin, and T*WI is the interaction of the time dummy and the Wisconsin state dummy.

The chart below displays the percentage of firms offering insurance in each state and time period.

California Wisconsin
Year 1 a b
Year 2 c d

The next chart explains what each coefficient in the regression represents.

Coefficient Calculation
β_0 a
β_1 c-a
β_2 b-a
β_3 (d-b)-(c-a)

We can see that β_0 is the baseline average, β_1 represents the time trend in the control group, β_2 represents the differences between the two states in year 1, and β_3 represents the difference in the changes over time. Assuming that both states have the same health insurance trends over time, we have now controlled for a possible national time trend. We can now identify what the true impact of the tax deductibility is on employers offering insurance.

This is a summary of an article by Robert Moffitt (1986):

Often in Public Economics, we come across budget constraints which are piecewise-linear. Some examples are studies of: the negative income tax, Social Security program, food stamp program, Aid to Families with Dependent Children (AFDC), and unemployment insurance. Wrinkles to the traditional piecewise linear budget constraint are: Halpern and Hausman (1985) who incorporate uncertainty of benefit receipt of disability insurance; Moffitt (1983) allowed welfare recipients to shop among different kinks in their budget constraint; Venti and Wise (1984) allowed there to be fixed costs to moving; and Hausman (1981) allowed there to be fixed costs (transportation and/or child care) involved in the decision to work or not. With regards to health insurance, workers who work ‘too many’ may see that their marginal tax rate (and thus the implicit health care subsidy they receive from group insurance) will change. This will lead to a piecewise linear budget constraint.

Model

I will use the tax subsidy to employer-provided health insurance to explain Moffitt’s model. Consumer maximize utility U(X,Y) s.t. (M=PX+Y if X<=X*; ) or (s.t. m=pX+Y, if X>X*). Here m=M+(p-P)X* which is the income loss which occurs when employees work ‘too many’ hours and move into a higher tax bracket. The function g() is the regular demand function in the standard linear setting. The demand function in the piecewise setting can be written as follows:

  • X=g(P,M) if X

  • X=X* if X=X*
  • X=g(p,m) if X>X*

On which segment will we be on? Well that depends on the indirect utility function. If V(P,M)>V(p,m) then we will choose the first segment and if V(P,M)< V(p,m) we will choose the second segment.

Writing the Demand function more concisely, we have:

X=D1*g(P,M)+D2*g(p,M+(p-P)X*) +(1-D1-D2)X*

  • D1=1 if X*-g(P,M)>0, 0 otherwise
  • D2=1 if g(p,M+(p-P)X*)-X*>0, 0 otherwise.

Comparative Statics:

  • dX/dP=D1*g_1(P,M)-D2*g_2(p,m)X* <= 0

    • If the price of health insurance increases (or the tax subsidy decreases) you will demand less health insurance

  • dX/dp=D2*[g_1(P,M)+g_2(p,m)X*] <= 0

    • If one moves to the high marginal tax area and the price of health insurance increases, demand for insurance will decrease.

  • dX/dM=D1*g_2(P,M)+D2*g_2(P,M) >=0

    • Health Insurance is assumed to be a normal good, so as income increases, the demand for health insurance increases.

Here g_1() is the derivative of g with respect to price; g_2() is the derivative of the demand function with respect to income. We can see that the income effects are non-negative on demand for X and the price effects are non-positive for X. These are simply derivatives however and a non-trivial change in price or income may cause an individual to switch from one segment of the budget constraint to another.

Incorporating Heterogeneous Preferences

One would guess that all consumers do not have the same preferences for health care. Some may have a low or high risk or illness, others may be more or less risk averse. Moffitt modifies the demand function to incorporate this so that the new ‘regular’ demand function is: g(P,M;B,A)=g(P,M;B)+A.

Maximum Likelihood Estimation

From the above equations, consumers will choose to be on the first area only if Ag(p,m;B)-X*. I assume that no one chooses to locate exactly on the kink between the two areas. If we assume that there is a normally distributed error term, we can create the following likelihood function.

L= [PI] Pr(X)

  • [PI] signifies that would should multiply the the probability of all the observations.

  • P(X)=f{A+e=X-g(P,M;B)| AX*-g(p,m;B)}

    • This is the probability we observe X, given the person chose X on area one (or area two).

  • General distributional assumptions are that e~N[0,(s_e)^2], and A~N[a,(s_a)^2] where a is the mean of A which can be estimated using an observed set of covariates.

Now, that we know the distribution of this function, we simply use numerical methods to estimate the desired parameters and we will estimates for the demand function for a piecewise-linear budget constraint.

Robert Moffitt, Robert (1986); The Econometrics of Piecewise-Linear Budget Constraints: A Survey and Exposition of the Maximum Likelihood Method ; Journal of Business & Economic Statistics, Vol. 4, No. 3. (Jul., 1986), pp. 317-328.