Econometrics

You are currently browsing articles tagged Econometrics.

How do you estimate the specific risk a smoking has on the probability of being hospitalized.  If smokers on average have lower income and less educational achievement, is smoking truly causing the increase in hospitalization or could the covariates fully or partially explain the increased hospitalization rates?

A paper by Kleinman and Norton suggests using adjusted risk ratios with logistic regressions.  The formula for this procedure is as follows:

  • ARR = [n-1Σi=1 to N riski(Xi|as if exposed)] ÷ [n-1Σi=1 to N riski(Xi|as if unexposed)]   (1)
  • ARD = [n-1Σi=1 to N riski(Xi|as if exposed)] – [n-1Σi=1 to N riski(Xi|as if unexposed)]   (2)

The authors explain the first equation as follows:

  • “The denominator of equation (1) is the mean of this calculated risk for each observation when the exposure variable is assumed to be unexposed and represents an MLE of the unexposed (baseline) risk for a population whose covariates are distributed as for the observed covariates for the entire study population. The numerator in equation (1) represents an MLE of the adjusted risk among the exposed. This approach is a specific example of using what are called “recycled predictions.”

Standard errors can be calculated using either bootstrapping or the Delta Method.  However, the authors wisely recommend bootstrapping the standard errors since it reduces the computations resources needed and can also allow for asymmetric confidence intervals.

Tags: ,

Let us say you have 10 observations of 2 different variables.  How do you determine which of the observations to use?  Should you throw out the outliers?  Should you only include the most similar values?  Does more observations increase or decrease the amount of measurement error?

These problems can be answered by the discipline of Statistics.  An interesting book by Stigler recounts The History of Statistics.  Astronomers lead many of the statistical advances in the seventeenth and eighteenth centuries.  Accurate measurement is very important to astronomers.  Further, observations with respect to the circumference and oblateness of the earth were made at different times and places throughout history.  This leaves a conundrum of  how best to combine these observations.

Mayer, Boscovich, and others contributed to the development of the idea of least squares, but Stigler credits Legendre with the invention of least squares.  Legendre came up with the idea in his attempt to measure the length of the median quadrant (the distance from the equator to the North Pole) through Paris.  

To demonstrate some of his ideas, I will use a simpler example.  Let us assume that a drug can have a dosage level between 0 and 5 and we want to find it’s impact on health (measured from a 0-10 scale).  Let us look at the following data.  The goal is to find the parameters m (slope) and b (intercept) that accurately measure the relationship between drug dosage and health (ignore any questions of endogeneity).  Should we include all 10 observations?

Although Euler recognized that including more observations increases the maximum possible error, Legendre realized that adding more observations also greatly increased the probability of getting close to the true value of the parameters of interest.  

In my example, we need to fit a line to measure the parameters m and b.  How do we set up the errors so that we have the most accurate calculations.  Laplace believed that the following two conditions would need to hold:

  1. Σi Dosagei*ei = 0
  2. Σi |Dosagei*ei| = minimum

The first condition basically says that the errors are uncorrelated with the independent variables on average.  The second condition hopes to minimize the errors.  Legendre extended Laplace’s second condition to minimize the sum of the squared errors rather than just the absolute error level.

Another key point is that this regression line must go through the “center of gravity.”  In my example, the average dosage for the ten observations is 2.2 and the average health level is 5.9.  This means the center of gravity is at the coordinates (2.2, 5.9).  In the solution in my example is to set m=1.1456 and b=3.3797.  We see that if we plug 2.2 into the equation, the output is 5.9; thus, the regression line does indeed go through the center of gravity.

Understanding the historical development of modern statistical techniques is an interesting task, and Stigler’s book enlightens the reader with much detail.

Tags: , , ,

ANOVA

Let us say that you are a hospital administrator.  You are very clever and have come up with a system to score the quality of the work done by the physicians at your hospital.  To simplify things, lets assume that you only have 3 physicians who work at your hospital.  The physician’s scores are as follows:

  • Dr. Albert: 76, 85, 91, 67, 73 
  • Dr. Burns: 92, 90, 60, 79, 75
  • Dr. Collin: 50, 80, 83, 80, 74

The average score for Dr. Albert is 78.4, for Dr. Burns is 79.2 and for Dr. Collin is 73.4.  As the hospital administrator, you want to know whether these differences are due to differences in doctor quality or likely from random chance.  If there were only two doctor’s a t-test would suffice, but what tests can you use in the case of multiple doctors?

The solution to this is to run an ANOVA test.  How do we do this?  Follow these easy steps.

  1. Let j be the group number (j=a, b, c) and i be the number obervation within each group (i=1, 2,…,5)
  2. Calculate the mean of each group (μj): μa= 78.4; μb 79.2; μc= 73.4.
  3. Also calculate the mean of the entire sample. μ=77
  4. Now calculate the Sum of Squares within each group [SSwithin = ΣΣ (Xij - μj)2].  This shows how much variation there is for each doctor.
    • SSa = (76 – 78.4)2 + (85 - 78.4)2 + (91 - 78.4)2 (67 – 78.4)2 + (73.4 – 78.4)2 = 367.2 
    • SSb = 666.8
    • SSc = 727.2
    • SSwithin  = SSa + SSb +  SSc = 1761.2
  5. Now calculate the Sum of Squares between each group. [SSbetween =Σ njj - μ)2].  This shows how much variation there is across each of the doctor’s average score.
    • SSbetween = 5*(78.4 -77)2 + 5*(79.2 – 77)2 + 5*(73.4 – 77)2 = 98.8
  6. The F-statistic is calculated as the mean square (MS) statistic for the between and within sum of squares (SS).  How do we go from the SS to the MS?  That’s easy, we just divide both by the degrees of freedom.
    • MSwithin  = SSwithin/(N-J) = SSwithin/13.  This is because there are 15 observations and 3 doctors so 15-3=12.  Our answer here is: 1761.2/12 = 146.77
    • MSbetween = SSbetween/(J-1) = SSwithin/2. This is because there are 3 doctors, we have 3-2=4. Our answer here is: 98.8/2 = 49.4.
  7. Now we can calculate the F statistic as: F = MSbetween/MSwithin = 49.4/166.77 = .337
  8. If we look this up on an chart for F-statistics, we see that the probability that all 3 doctors are equally good is .721.  Thus, we fail to reject the null that all three doctors are equally good.

STATA

Is there an easier way to do this?  Yes.  If you have Stata, you could just use the score as the dependent variable and have dummy variables for Drs. A, B, an C. The you can run a statistical test that the coefficient estimate for Dr. A = the coefficient estimate for Dr. B = the coefficient estimate for Dr. C.  This will give you the same probability that the three doctors are equally skilled that we calculated manually above.

Tags: ,

Nobel laureate James Heckman has a nice summary of how applied econometricians and policy researchers should define causality. Some of the more interesting points I have excerpted below.

On the source of randomness in a sample

One reason why many statistical models are incomplete is that they do not specify the sources of randomness generating variability among agents, i.e., they do not specify why otherwise observationally identical people make different choices and have different outcomes given the same choice. They do not distinguish what is in the agent’s information set from what is in the observing statistician’s information set, although the distinction is fundamental in justifying the properties of any estimator for solving selection and evaluation problems. They do not distinguish uncertainty from the point of view of the agent whose behavior is being analyzed from variability as analyzed by the observing analyst. They are also incomplete because they are recursive. They do not allow for simultaneity in choices of outcomes of treatment that are at the heart of game theory and models of social interactions and contagion (see, e.g., Brock & Durlauf, 2001; Tamer, 2003).

Unbundling a treatment

Researchers often say that a policy change will cause a change in some outcome measure. However, a policy change is often made up of many components. Which components of the policy change actually influenced the outcomes? In Heckman’s words:

Many causal models in statistics are black-box devices designed to investigate the impact of “treatments”—often complex packages of interventions—on observed outcomes in a given environment. Unbundling the components of complex treatments is rarely done. Explicit scientific models go into the black box to explore the mechanism(s) producing the effects.

Outcomes vs. Utilities

Most researchers pick an outcome variable of interest and if the outcome increases–assuming a beneficial outcome measure–than people are better off. This may not be the case however. For instance, Bill Clinton’s welfare reform act (PRWORA) may have increased employment rates and income for single mothers, but the mother’s utility may have decreased. The single mothers may (or may not) have valued spending time caring for their child more than working.

Problems with non-linearity

Issues such as “social interactions, contagion and general equilibrium effects” can complicate causal inference.

What are you measuring?

Let us assume that Y is the outcome variable of interest. Y depends on what state, s, you are in. For instance, in a treatment/no treatment world, Y(s) is the outcome if you would be treated and Y(s’) is the effect if you were not treated. D(s)=1 if you were actually treated and D(s)=0 if you did not receive treatment in the data. Thus, we can measure various things:

  • Average Treatment Effect (ATE): E (Y s) − Y(s’)). This is equal to the average effect if all individuals moved from a untreated to a treated state.
  • Treatment on the Treated (TT): E[(Y(s) − Y(s')) | D(s) = 1]. This looks at the average effect of treatment only on those who were treated. This is important if only certain individual select into the treatment group, or if the policy change is only relevant for certain individuals.
  • Treatment on the Untreated (TUT): E[(Y(s) − Y(s')) | D(s) = 0]. It is also possible that treatment can affect those who are not treated. For instance, instituting a work training program for treated individuals may reduce community college enrollment and thus may affect untreated individuals (e.g., if the community college closes from lack of enrollment).
  • Policy relevant treatment effect (PRTE):Ep[Y(s)] − Ep’ [Y(s)]. The estimator compares the average outcomes of two different policy choices.

Heckman, James (2008) “Economic Causality” NBER WP #13934.

Tags: , , ,

Much of health care data is characterized by a large cluster of data at 0, and a right skewed distribution of the remaining outcomes. For instance, people who do not get sick generally use $0 of medical care. Those who do get sick, use a varying amount of medical care dollars, but there are a large number of outliers with extremely expensive medical care. How do health economists take these anomalies into account?

David Madden looks at two alternatives to correct for the shape of the distribution in his 2008 JHE paper: sample selection and two-part models. Zero consumption of medical can be caused from two different decisions: a participation decision and a consumption decision. For instance, in the case of smoking, individuals may decide not to smoke no matter how cheap cigarettes get (participation decision). On the other hand, some smokers may decide not to smoke during a given time period because cigarettes are very expensive or they have low income (consumption decision). Since people can not smoke negative cigarettes, there still may be a cluster of observations around zero.

Assume that individuals utility from participation is equal to w=α’Z + v. If w>0, then d=1, (the individual participates) and if w<0, then d=0, (the individual does not participate). For consumption, individuals will choose y**=max[0,y*]; y*= β’X + u. A general model can be written as follows:

  • L0 = Π0 [1-P(v>-α'Z) P(u>-β'X |v>-α'Z)] Π+ P(v>-α’Z) P(u > -β’X|v>-α’Z) g(y|v>-α’Z,u > -β’X)

If u and v are independent, then we have the Cragg model:

  • L10 [1-P(v>-α'Z) P(u>-β'X)] Π+ P(v>-α’Z) P(u > -β’X) g(y|u > -β’X)

If we assume that the participation constraint dominates the consumption constraint (which is likely in the smoking example, but maybe not for drinking), then we have P(y*>0|d=1)=1 and g(y*|y*>0,d=1)=g(y*|d=1). This means that if you are a smoker you will have at least one cigarette per period. When the participation constraint dominates, we ignore the consumption decision and we have the following likelihood function which corresponds to the Heckman Selection model.

  • L20 [1-P(v>-α’Z) Π+ P(v>-α’Z) g(y|v>-α’Z)

If independence is assumed, then we are left with probit for participation and OLS for consumption. This is the two part model:

  • L30 [1-P(v>-α’Z) Π+ P(v>-α’Z) g(y)

Which of these models works best empirically?

Results

Madden looks at the fit of regressions trying to model smoking and drinking behavior using a wide variety of covariates. In general, the two-part model seems to be perform better in the data used for this study, but the author wisely notes that deciding between the Heckman selection and the two-part model should be done on a case-by-case basis.

Tags: , , , ,

Let us assume that there are two types of people: smart people an dumb people. Smart people’s test scores are normally distributed about 80% and dumb people’s tests scores are normally distributed about 40% on their test. If we observe the test score of one person, how do we know if they are smart or dumb? If we see a score of 85%, we are pretty sure they are smart. A dumb person might have had a good day, but this would be a low probability event. Similarly, if we saw a score of 35%, we would be fairly certain that the person is dumb, even though there is a small probability that a smart person may have had a bad day. If we see a score of 62%, however, then it is very difficult to distinguish if the person is smart of dumb. But how can we quantify the probabilities that a person is of a certain type.

One way of doing this is finite mixture models. Jim Hamilton’s Time Series Analysis book has a good explanation of this topic and I will review this material here.

Each type (e.g.: how smart the person is) will be designated as st=1,2,…, or N. Let us assume that there is an observed variable yt (e.g.: the test score) which is distributed according to a N(μsj2). What researchers wants to know is that given that we observe yt, what is the probability that the observation is from a person of type st=j.

Let us assume that we know the density of yt is:

  • f(yt|st=j;θ)=(2πσj2)-1/2 * exp{-(yt – μj)/2σj2}

There is also some underlying distribution of types.

  • P(st=j;θ)=λj
  • θ=(μ1,…,μN1,…,σN1,…,λN)

From Bayes Rule, we know that:

  • P(A and B)=P(A|B)*P(B), which implies
  • f(yt,st=j;θ)=λj*(2πσj2)-1/2 * exp{-(yt – μj)/2σj2}

The unconditional density can be found as follows:

  • f(yt;θ)=Σ1 to N p(yt,st=j;θ)
  • f(yt;θ)=λ1*(2πσ12)-1/2 * exp{-(yt – μ1)/2σ12} +…+λN*(2πσN2)-1/2 * exp{-(yt – μN)/2σN2}

Now we can use maximum likelihood estimation techniques to find the θ which will maximize:

  • maxθ L(θ)=Σ1 to Tlog f(yt;θ)
  • s.t.: λ1 + λ2 +…+ λN=1
  • s.t: λj≥0

Once we have the MLE estimated θ, we can figure out what the probability is that observation yt came from a person of type st=j. Using Bayes theory, again, we know that:

  • P(st=j|yt;θ)=f(yt,st=j;θ)/f(yt;θ)=λj*f(yt|st=j;θ)/f(yt;θ)

This value represents the probabilty, given the observed data, that the unobserved type responsible for observation t was in of type j. For example, “…if an observation yt=0,, one could be vertually certain that the observation had come from a N(0,1) distribution rather than a N(4,1) distribution, so that P(st=1|yt;θ) for that date would be near unity. If instead yt were around 2.3, it is equally likely that the observation might have come from either regime so that P(st=1|yt;θ) for such an observation would be close to 0.5.”

Most of the above content came is from:

  • James D. Hamilton (1994) Time Series Analysis, Princeton University Press, Princeton, NJ; pp. 685-689.

Tags: , , ,

What is the effect a country’s GDP on health? What about the country’s literacy rate on infant mortality rates? Often researchers try to answer these questions using time-series data. With time series data, we have observations of a few units (e.g.: countries or individuals) over many years.

Let the subscript i represent the the individual or country and the subscript t indicate the year. We can have a regression framework as follows:

  • yit = βxit + εit

As long as cov(xitit)=0, then ordinary least squares (OLS) will provide an unbiased estimate of β1.

One frequent problem which occurs with time series data is that there will be serial correlation. Serial correlation (or autocorrelation) occurs when the error terms are correlated over time. For instance,

  • εit=ρεit-1it

Serial correlation means that if your predicted y value is overestimated in period, it is likely to be overestimated in another period. This is likely due to some persistent variable omitted the regression. For instance, if we regressed test scores on a vector of explanatory variables, it is likely that student who scored higher than their predicted test score in one period would also score higher then their predicted test score in another period.

Fortunately, our coefficient vector (β) is still unbiased even in the presence of serial correlation. However, OLS is inefficient. In this case, the standard errors are too small.

One way to test for serially correlation is to use the Durbin-Watson test. Let uit be the fitted values of the error terms after we conduct and OLS regression (uit = yitβols xit ).

The Durbin Watson statistic is:

  • d= [Σ(t=2 to T) (uit - uit-1)2] / [Σ(t=1 to T) (uit)2]

With panel data we have:

  • d= [Σ(i=1 to N)Σ(t=2 to T) (uit - uit-1)2] / [Σ(i=1 to N)Σ(t=1 to T) (uit)2]

This page will help you interpret the statistic as to whether or not you should accept or reject serial correlation. If there is serial correlation in your data, you may want to include a lagged dependent variable as one of your right hand side variables. This will result in an AR(1) specification.
Yuting Wang of Notre Dame has a good explanation of the problems that occur with serial correlation.

Tags: ,

Newer entries »